Skip to content

specificlanguages/mps-ai-benchmarks

Repository files navigation

MPS LLM Tooling Bench

A repeatable bench for evaluating how well Claude Code performs real JetBrains MPS work with different MPS↔agent tools, using mbeddr.core as the subject project.

It is a per-tool capability assessment: every condition shares the same task list, prompts, model, and grading rubrics, and only the tool differs. The MPS version a tool runs on is recorded per result, so a reader can judge whether it matters for a given task. See docs/DESIGN.md for the full framing.

What it measures

Given an identical task prompt, can the agent do real MPS work — author a BaseLanguage method, manipulate the model, keep it error-free and compiling — and how does each tool change what's possible? A first-class dimension is whether the result is idiomatic MPS (smodel concept literals, node<> types, typed queries) or plain-Java reflection with hardcoded concept IDs.

The output is published evidence: transcripts, artifacts, and gradings under results/, framed as "what happened when I tried."

Conditions

Condition What MPS instance
baseline No MCP, no MPS. Claude on the raw .mps XML none
mps-mcp MPS's bundled MCP server MPS 2026.1

baseline is the control: do MCP servers actually beat Claude grepping XML with its file tools?

Each condition runs from a golden config image — a pristine CLAUDE_CONFIG_DIR into which the tool's own skill-installation flow ran once (setup/setup-<condition>.sh). Every run gets a throwaway scratch copy, so no personal config, memories, or cross-run state can leak in. Images are built locally and never committed; conditions/<c>/image-manifest.txt records what your image contained so others can compare or knowingly diverge. See Isolation for the full rationale.

Prerequisites

  • macOS (the scripts use open/osascript; ports welcome).
  • The MPS installation referenced in bench.config.sh, with its bundled MCP server enabled (a one-time, in-IDE step).
  • The Claude Code CLI, plus either a Claude subscription (claude setup-token) or an Anthropic API key.
  • git, python3 (stdlib only), rsync, curl, and a JDK suitable for the mbeddr gradle build.
  • tmux — only for the optional interactive invocation mode (see Invocation modes).

Setup

cp bench.config.local.example.sh bench.config.local.sh   # edit: auth token, app paths
./setup/provision-workspace.sh    # clone mbeddr @ pinned SHA, gradle-generate
                                  # tools/BigProject, bake task seeds, snapshot
./setup/setup-baseline.sh         # empty golden image + strict settings
./setup/setup-mps-mcp.sh          # unattended: launches MPS, calls the MCP
                                  # server's init tool, installs its skills

provision-workspace.sh is the slow one (it clones mbeddr and runs the gradle build). It is keyed by MBEDDR_SHA + PROVISIONING_VERSION; bump either to force a fresh snapshot. The setup scripts are fast and rebuild the per-condition images on top of that snapshot.

Running

./run.sh 00-smoke baseline        # validate the plumbing cheaply first
./run.sh 01-migrationscripts mps-mcp
./run-matrix.sh                   # all real tasks x conditions (skips done cells)
./run-matrix.sh --reps 3          # more repetitions per cell
./run-matrix.sh --conditions baseline,mps-mcp --tasks 01-migrationscripts

A run resets the workspace from the snapshot, cold-starts the condition's MPS, waits for its MCP endpoint, then runs Claude Code with the task prompt — identical across conditions; any tool-specific guidance must come from the tool's own installed skills. A hard wall-clock timeout (RUN_TIMEOUT_SECONDS, 45 minutes by default) bounds each run.

Invocation modes

Set INVOCATION_MODE in bench.config.sh:

  • print (default): headless claude -p with stream-json output. The simplest path.
  • interactive: a real Claude Code session driven inside a detached tmux server, with completion detected via an injected Stop hook. Retained as a fallback for the case where headless claude -p bills differently from a normal subscription session; flip INVOCATION_MODE to switch. See Agent invocation.

What a run captures

Into results/<task>/<condition>/rep<N>/:

  • transcript.jsonl — the conversation with token usage (the primary evidence)
  • metadata.json — model, status, turns/tokens, durations, versions, snapshot id, invocation + auth mode
  • artifacts/ — a complete workspace diff (changed/new files + deleted.txt)
  • grading.yaml — a self-contained copy of the task rubric (id/desc/max) with empty score/notes to fill in; descriptions and max points are synced from the task's rubric.yaml (the source of truth) via ./sync-grading.sh
  • stderr.log — agent stderr (print mode)
  • In interactive mode also: transcript.txt (a /export-rendered conversation), stop-input.json (Stop-hook stdin), and ui.log (raw tmux pane output)

A timed-out run is killed before any rendered transcript can be produced. To read its transcript.jsonl:

python3 lib/render_transcript.py results/<task>/<condition>/rep1 [--full]

ui.log (interactive mode only) is a raw terminal byte stream — pagers show garbage. Replay it into a real terminal of the recorded size: tmux -L view new -x 220 -y 50, then cat ui.log inside.

Grading and reporting

Grading is manual, per rubric item, with partial credit — fill in the score: fields in each run's grading.yaml. Partial credit keeps signal that binary pass/fail would throw away at low rep counts.

To inspect a result in MPS (e.g. to run the model checker or a build), rebuild the run's workspace from the snapshot plus its artifacts overlay:

./restore-run.sh results/01-migrationscripts/mps-mcp/rep1 --open

Grading is decoupled from running: any third party can restore a run and re-grade it. Aggregate all gradings into a markdown table:

./report.sh

report.sh also validates each grading against its rubric: it errors on a missing item or a score over max, and warns on a description that no longer matches the rubric or leftover invalid: entries (a blank score is neither — ungraded is fine). After editing a rubric.yaml, run ./sync-grading.sh to push the new descriptions/max into every run's grading.yaml, add entries for new items, reorder to match the rubric, and park now-invalid scores (item removed, or score over the new max) under an invalid: key for re-grading.

Adding tasks

Create tasks/<nn-name>/prompt.md (the exact prompt — no tool-specific hints) and rubric.yaml (a flat item list of id/desc/max — the single source of truth for descriptions and max points, copied into each run's grading.yaml alongside its score/notes. Keep the schema, the reporter parses it line-based). If a task needs a pre-built starting point (e.g. an existing stub for the agent to fill in), add it as a seed baked into the snapshot at provisioning time — see Tasks and pre-seeding.

Known caveats

  • The MPS version a tool runs on is recorded per run (mps_app in metadata); whether it matters depends on the task — see docs/DESIGN.md.
  • Golden images depend on tool versions at setup time. Re-run the setup scripts after updating a tool, and commit the refreshed manifest.
  • The sandbox settings key and the stream-json format track current Claude Code behavior; if an update changes them, the smoke task surfaces it.

About

Benchmarking harness for MPS AI tooling

Resources

Stars

Watchers

Forks

Contributors