localperf is a local LLM inference benchmark CLI. It runs benchmark plans against local inference servers, collects all evidence in one portable SQLite artifact per model, and renders reports that only label what the measurements actually confirm.
It is currently focused on vLLM-managed runs, with an engine-neutral benchmark spec and CLI shape:
localperf sweep plan --model <model-id> --out spec.json
localperf bench run --spec spec.json --artifact runs/models/<model-slug>.sqlite
localperf artifact render runs/models/<model-slug>.sqliteInstall with Go:
go install github.com/dutifuldev/localperf/cmd/localperf@v0.1.0Or download a prebuilt binary (linux/darwin, amd64/arm64) from the
latest release.
From a repo checkout, go run ./cmd/localperf works everywhere localperf
appears below.
- vLLM installed and available as
vllmfor real managed benchmark runs. - Enough available system memory for the model profile you run.
- Go 1.26 or newer only when installing via
go installor running from source. sqlite3if you want to inspect artifacts from the shell.
The included DiffusionGemma example targets
nvidia/diffusiongemma-26B-A4B-it-NVFP4 on a GB10/DGX Spark-class local
machine. Edit the spec before using it on a different machine or model.
Generate the default context/concurrency sweep spec instead of hand-writing the grid:
localperf sweep plan \
--model nvidia/diffusiongemma-26B-A4B-it-NVFP4 \
--contexts 8k,16k,32k --concurrency 1,4,8 \
--out spec.jsonPreview the planned runs without starting a model:
localperf bench plan --spec spec.jsonRun one dry benchmark case and validate the artifact:
localperf bench run \
--dry-run \
--spec examples/diffusiongemma-vllm-standard/spec.json \
--profile 4k-reference \
--workload claim-repro-1k-out1024 \
--concurrency 1 \
--run-dir /tmp/localperf-onecase-dry
localperf artifact check /tmp/localperf-onecase-dry.sqliteRun the full spec only when the machine is ready for it, pointing batches at one model-level artifact:
localperf bench run --spec spec.json --timeout 4h \
--artifact runs/models/<model-slug>.sqliteRender the HTML report:
localperf artifact render runs/models/<model-slug>.sqliteKeep every run of one model in a single SQLite file and render one HTML
report from it. Pointing bench run --artifact at an existing artifact
appends the new run; re-running the same run directory replaces that run.
Combine existing per-run artifacts with:
localperf artifact merge \
--into runs/models/<model-slug>.sqlite runs/batch-1.sqlite runs/batch-2.sqliteMerges are idempotent: runs already present are skipped, and a run id that collides with different provenance is refused instead of silently replaced. The report lists every run and aggregates repeated points across runs with mean ± spread.
Every workload must declare what its context number means:
"context_target": 32768,
"context_semantics": "active""active" claims the workload actually pushes ~N tokens through the KV cache
and is validated: the requested input+output must land within 90–100% of the
target, on the random dataset, with a fixed range ratio. "capacity" marks a
server-limit/concurrency point and must match the profile's max_model_len.
Specs that conflate the two are refused before any GPU time is spent, and the
report labels rows only by declared-and-measured active context or by the
measured token shape — never by max_model_len alone. See
Context Semantics for the contract.
Workloads may also declare latency targets for goodput:
"slo": {"ttft_p95_ms": 500}The report then shows the fraction of requests meeting the target and goodput in requests per second.
Each run writes:
- the SQLite artifact (
--artifactpath, orruns/<run-id>.sqlite): the canonical record — specs, engine/profile/workload definitions, measurements, per-request rows, metric stats, GPU telemetry, hardware inventory, engine identity probes, events, commands, and logs. runs/<run-id>/events.jsonl,results/*.json,logs/*.log: raw run data.runs/<run-id>/report.md,report.json,report.csv: legacy exports; the HTML report rendered from the artifact is the authoritative view.
Example inspection:
sqlite3 runs/models/<model-slug>.sqlite \
"select run_id, profile_id, workload_id, concurrency, status, aggregate_output_tok_s from measurements"Specs include a safety.min_mem_available_gib floor. localperf checks
/proc/meminfo before major steps and while subprocesses run. If available
memory drops below the floor, the current step is stopped and skipped/failed
rows are recorded.
On unified-memory systems, do not treat process/cgroup memory as total model memory. For capacity planning, compare multiple signals:
- whole-machine
MemAvailabledrop, - process/cgroup memory,
- vLLM KV-cache capacity lines,
- GPU or platform telemetry when available.
localperf samples GPU utilization and memory during measurements from every
available source (tegrastats, nvidia-smi) and names the source in the
report. See Measurement Methods for
the memory reporting policy.
The repo includes two useful examples:
examples/diffusiongemma-vllm-standard/: a reusable vLLM benchmark spec plus a completed known-run fixture.examples/gemma4-vllm-resource-sweep-20260620/: an earlier Gemma 4 resource sweep with generated tables, plots, and an HTML report.
Open the Gemma 4 report locally:
python3 -m http.server 8766 --directory examples/gemma4-vllm-resource-sweep-20260620/reportsThen visit http://127.0.0.1:8766/.