A repeatable bench for evaluating how well Claude Code performs real JetBrains MPS work with different MPS↔agent tools, using mbeddr.core as the subject project.
It is a per-tool capability assessment: every condition shares the same task list, prompts, model, and grading rubrics, and only the tool differs. The MPS version a tool runs on is recorded per result, so a reader can judge whether it matters for a given task. See docs/DESIGN.md for the full framing.
- Why it's built the way it is — docs/DESIGN.md
- Want to build your own MPS bench? The traps we hit and how we got past them — docs/BUILDING-AN-MPS-BENCH.md
Given an identical task prompt, can the agent do real MPS work — author a BaseLanguage method, manipulate the model,
keep it error-free and compiling — and how does each tool change what's possible? A first-class dimension is whether the
result is idiomatic MPS (smodel concept literals, node<> types, typed queries) or plain-Java reflection with
hardcoded concept IDs.
The output is published evidence: transcripts, artifacts, and gradings under results/, framed as "what happened when I
tried."
| Condition | What | MPS instance |
|---|---|---|
baseline |
No MCP, no MPS. Claude on the raw .mps XML |
none |
mps-mcp |
MPS's bundled MCP server | MPS 2026.1 |
baseline is the control: do MCP servers actually beat Claude grepping XML with its file tools?
Each condition runs from a golden config image — a pristine CLAUDE_CONFIG_DIR into which the tool's own
skill-installation flow ran once (setup/setup-<condition>.sh). Every run gets a throwaway scratch copy, so no personal
config, memories, or cross-run state can leak in. Images are built locally and never committed;
conditions/<c>/image-manifest.txt records what your image contained so others can compare or knowingly diverge. See
Isolation for the full rationale.
- macOS (the scripts use
open/osascript; ports welcome). - The MPS installation referenced in
bench.config.sh, with its bundled MCP server enabled (a one-time, in-IDE step). - The Claude Code CLI, plus either a Claude subscription (
claude setup-token) or an Anthropic API key. git,python3(stdlib only),rsync,curl, and a JDK suitable for the mbeddr gradle build.tmux— only for the optionalinteractiveinvocation mode (see Invocation modes).
cp bench.config.local.example.sh bench.config.local.sh # edit: auth token, app paths
./setup/provision-workspace.sh # clone mbeddr @ pinned SHA, gradle-generate
# tools/BigProject, bake task seeds, snapshot
./setup/setup-baseline.sh # empty golden image + strict settings
./setup/setup-mps-mcp.sh # unattended: launches MPS, calls the MCP
# server's init tool, installs its skillsprovision-workspace.sh is the slow one (it clones mbeddr and runs the gradle build). It is keyed by MBEDDR_SHA +
PROVISIONING_VERSION; bump either to force a fresh snapshot. The setup scripts are fast and rebuild the per-condition
images on top of that snapshot.
./run.sh 00-smoke baseline # validate the plumbing cheaply first
./run.sh 01-migrationscripts mps-mcp
./run-matrix.sh # all real tasks x conditions (skips done cells)
./run-matrix.sh --reps 3 # more repetitions per cell
./run-matrix.sh --conditions baseline,mps-mcp --tasks 01-migrationscriptsA run resets the workspace from the snapshot, cold-starts the condition's MPS, waits for its MCP endpoint, then runs
Claude Code with the task prompt — identical across conditions; any tool-specific guidance must come from the tool's own
installed skills. A hard wall-clock timeout (RUN_TIMEOUT_SECONDS, 45 minutes by default) bounds each run.
Set INVOCATION_MODE in bench.config.sh:
print(default): headlessclaude -pwith stream-json output. The simplest path.interactive: a real Claude Code session driven inside a detachedtmuxserver, with completion detected via an injectedStophook. Retained as a fallback for the case where headlessclaude -pbills differently from a normal subscription session; flipINVOCATION_MODEto switch. See Agent invocation.
Into results/<task>/<condition>/rep<N>/:
transcript.jsonl— the conversation with token usage (the primary evidence)metadata.json— model, status, turns/tokens, durations, versions, snapshot id, invocation + auth modeartifacts/— a complete workspace diff (changed/new files +deleted.txt)grading.yaml— a self-contained copy of the task rubric (id/desc/max) with emptyscore/notesto fill in; descriptions and max points are synced from the task'srubric.yaml(the source of truth) via./sync-grading.shstderr.log— agent stderr (print mode)- In
interactivemode also:transcript.txt(a/export-rendered conversation),stop-input.json(Stop-hook stdin), andui.log(raw tmux pane output)
A timed-out run is killed before any rendered transcript can be produced. To read its transcript.jsonl:
python3 lib/render_transcript.py results/<task>/<condition>/rep1 [--full]ui.log (interactive mode only) is a raw terminal byte stream — pagers show garbage. Replay it into a real terminal of
the recorded size: tmux -L view new -x 220 -y 50, then cat ui.log inside.
Grading is manual, per rubric item, with partial credit — fill in the score: fields in each run's grading.yaml.
Partial credit keeps signal that binary pass/fail would throw away at low rep counts.
To inspect a result in MPS (e.g. to run the model checker or a build), rebuild the run's workspace from the snapshot plus its artifacts overlay:
./restore-run.sh results/01-migrationscripts/mps-mcp/rep1 --openGrading is decoupled from running: any third party can restore a run and re-grade it. Aggregate all gradings into a markdown table:
./report.shreport.sh also validates each grading against its rubric: it errors on a missing item or a score over max, and
warns on a description that no longer matches the rubric or leftover invalid: entries (a blank score is neither —
ungraded is fine). After editing a rubric.yaml, run ./sync-grading.sh to push the new descriptions/max into every
run's grading.yaml, add entries for new items, reorder to match the rubric, and park now-invalid scores (item removed,
or score over the new max) under an invalid: key for re-grading.
Create tasks/<nn-name>/prompt.md (the exact prompt — no tool-specific hints) and rubric.yaml (a flat item list of
id/desc/max — the single source of truth for descriptions and max points, copied into each run's grading.yaml
alongside its score/notes. Keep the schema, the reporter parses it line-based). If a task needs a pre-built starting
point (e.g. an existing stub for the agent to fill in), add it as a seed baked into the snapshot at provisioning time —
see Tasks and pre-seeding.
- The MPS version a tool runs on is recorded per run (
mps_appin metadata); whether it matters depends on the task — see docs/DESIGN.md. - Golden images depend on tool versions at setup time. Re-run the setup scripts after updating a tool, and commit the refreshed manifest.
- The
sandboxsettings key and the stream-json format track current Claude Code behavior; if an update changes them, the smoke task surfaces it.