MPS LLM Tooling Bench

A repeatable bench for evaluating how well Claude Code performs real JetBrains MPS work with different MPS↔agent tools, using mbeddr.core as the subject project.

It is a per-tool capability assessment: every condition shares the same task list, prompts, model, and grading rubrics, and only the tool differs. The MPS version a tool runs on is recorded per result, so a reader can judge whether it matters for a given task. See docs/DESIGN.md for the full framing.

Why it's built the way it is — docs/DESIGN.md
Want to build your own MPS bench? The traps we hit and how we got past them — docs/BUILDING-AN-MPS-BENCH.md

What it measures

Given an identical task prompt, can the agent do real MPS work — author a BaseLanguage method, manipulate the model, keep it error-free and compiling — and how does each tool change what's possible? A first-class dimension is whether the result is idiomatic MPS (smodel concept literals, node<> types, typed queries) or plain-Java reflection with hardcoded concept IDs.

The output is published evidence: transcripts, artifacts, and gradings under results/, framed as "what happened when I tried."

Conditions

Condition	What	MPS instance
`baseline`	No MCP, no MPS. Claude on the raw `.mps` XML	none
`mps-mcp`	MPS's bundled MCP server	MPS 2026.1

baseline is the control: do MCP servers actually beat Claude grepping XML with its file tools?

Each condition runs from a golden config image — a pristine CLAUDE_CONFIG_DIR into which the tool's own skill-installation flow ran once (setup/setup-<condition>.sh). Every run gets a throwaway scratch copy, so no personal config, memories, or cross-run state can leak in. Images are built locally and never committed; conditions/<c>/image-manifest.txt records what your image contained so others can compare or knowingly diverge. See Isolation for the full rationale.

Prerequisites

macOS (the scripts use open/osascript; ports welcome).
The MPS installation referenced in bench.config.sh, with its bundled MCP server enabled (a one-time, in-IDE step).
The Claude Code CLI, plus either a Claude subscription (claude setup-token) or an Anthropic API key.
git, python3 (stdlib only), rsync, curl, and a JDK suitable for the mbeddr gradle build.
tmux — only for the optional interactive invocation mode (see Invocation modes).

Setup

cp bench.config.local.example.sh bench.config.local.sh   # edit: auth token, app paths
./setup/provision-workspace.sh    # clone mbeddr @ pinned SHA, gradle-generate
                                  # tools/BigProject, bake task seeds, snapshot
./setup/setup-baseline.sh         # empty golden image + strict settings
./setup/setup-mps-mcp.sh          # unattended: launches MPS, calls the MCP
                                  # server's init tool, installs its skills

provision-workspace.sh is the slow one (it clones mbeddr and runs the gradle build). It is keyed by MBEDDR_SHA + PROVISIONING_VERSION; bump either to force a fresh snapshot. The setup scripts are fast and rebuild the per-condition images on top of that snapshot.

Running

./run.sh 00-smoke baseline        # validate the plumbing cheaply first
./run.sh 01-migrationscripts mps-mcp
./run-matrix.sh                   # all real tasks x conditions (skips done cells)
./run-matrix.sh --reps 3          # more repetitions per cell
./run-matrix.sh --conditions baseline,mps-mcp --tasks 01-migrationscripts

A run resets the workspace from the snapshot, cold-starts the condition's MPS, waits for its MCP endpoint, then runs Claude Code with the task prompt — identical across conditions; any tool-specific guidance must come from the tool's own installed skills. A hard wall-clock timeout (RUN_TIMEOUT_SECONDS, 45 minutes by default) bounds each run.

Invocation modes

Set INVOCATION_MODE in bench.config.sh:

print (default): headless claude -p with stream-json output. The simplest path.
interactive: a real Claude Code session driven inside a detached tmux server, with completion detected via an injected Stop hook. Retained as a fallback for the case where headless claude -p bills differently from a normal subscription session; flip INVOCATION_MODE to switch. See Agent invocation.

What a run captures

Into results/<task>/<condition>/rep<N>/:

transcript.jsonl — the conversation with token usage (the primary evidence)
metadata.json — model, status, turns/tokens, durations, versions, snapshot id, invocation + auth mode
artifacts/ — a complete workspace diff (changed/new files + deleted.txt)
grading.yaml — a self-contained copy of the task rubric (id/desc/max) with empty score/notes to fill in; descriptions and max points are synced from the task's rubric.yaml (the source of truth) via ./sync-grading.sh
stderr.log — agent stderr (print mode)
In interactive mode also: transcript.txt (a /export-rendered conversation), stop-input.json (Stop-hook stdin), and ui.log (raw tmux pane output)

A timed-out run is killed before any rendered transcript can be produced. To read its transcript.jsonl:

python3 lib/render_transcript.py results/<task>/<condition>/rep1 [--full]

ui.log (interactive mode only) is a raw terminal byte stream — pagers show garbage. Replay it into a real terminal of the recorded size: tmux -L view new -x 220 -y 50, then cat ui.log inside.

Grading and reporting

Grading is manual, per rubric item, with partial credit — fill in the score: fields in each run's grading.yaml. Partial credit keeps signal that binary pass/fail would throw away at low rep counts.

To inspect a result in MPS (e.g. to run the model checker or a build), rebuild the run's workspace from the snapshot plus its artifacts overlay:

./restore-run.sh results/01-migrationscripts/mps-mcp/rep1 --open

Grading is decoupled from running: any third party can restore a run and re-grade it. Aggregate all gradings into a markdown table:

./report.sh

report.sh also validates each grading against its rubric: it errors on a missing item or a score over max, and warns on a description that no longer matches the rubric or leftover invalid: entries (a blank score is neither — ungraded is fine). After editing a rubric.yaml, run ./sync-grading.sh to push the new descriptions/max into every run's grading.yaml, add entries for new items, reorder to match the rubric, and park now-invalid scores (item removed, or score over the new max) under an invalid: key for re-grading.

Adding tasks

Create tasks/<nn-name>/prompt.md (the exact prompt — no tool-specific hints) and rubric.yaml (a flat item list of id/desc/max — the single source of truth for descriptions and max points, copied into each run's grading.yaml alongside its score/notes. Keep the schema, the reporter parses it line-based). If a task needs a pre-built starting point (e.g. an existing stub for the agent to fill in), add it as a seed baked into the snapshot at provisioning time — see Tasks and pre-seeding.

Known caveats

The MPS version a tool runs on is recorded per run (mps_app in metadata); whether it matters depends on the task — see docs/DESIGN.md.
Golden images depend on tool versions at setup time. Re-run the setup scripts after updating a tool, and commit the refreshed manifest.
The sandbox settings key and the stream-json format track current Claude Code behavior; if an update changes them, the smoke task surfaces it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MPS LLM Tooling Bench

What it measures

Conditions

Prerequisites

Setup

Running

Invocation modes

What a run captures

Grading and reporting

Adding tasks

Known caveats

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conditions		conditions
docs		docs
lib		lib
results		results
setup		setup
tasks		tasks
.editorconfig		.editorconfig
.gitignore		.gitignore
.prettierrc.json5		.prettierrc.json5
README.md		README.md
bench.config.local.example.sh		bench.config.local.example.sh
bench.config.sh		bench.config.sh
lefthook.yaml		lefthook.yaml
report.sh		report.sh
restore-run.sh		restore-run.sh
run-matrix.sh		run-matrix.sh
run.sh		run.sh
sync-grading.sh		sync-grading.sh

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MPS LLM Tooling Bench

What it measures

Conditions

Prerequisites

Setup

Running

Invocation modes

What a run captures

Grading and reporting

Adding tasks

Known caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages