Parsewave experiments on the ProgramBench reverse-engineering benchmark.
A fork of facebookresearch/programbench with our agent doctrine and rollout traces overlaid. Upstream README continues below.
The headline experiment in this repo is a mean-of-5 evaluation of Claude Opus 4.7 (max effort) on the easy-10 ProgramBench subset, using a clean-room reverse-engineering doctrine.
- Scoreboard + methodology:
exp-traces/README-best-of-5.md - Per-rollout traces (10 tasks × 5 rollouts = 50 task-traces): under
exp-traces/cc-opus-easy10-v2-C-{1..5}/, each task dir has the fulltrajectory.jsonl,submission.tar.gz, andeval.json. - Doctrine:
opus-experiment/CLAUDE.md— the clean-room RE prompt used for all rollouts. - Charts:
exp-traces/figures/— per-task scores, leaderboard comparison, per-rollout variance.
Headlines (filtered pass rate per programbench info):
- Mean across 5 rollouts × 10 tasks: 96.9 (vs 95.8 for Claude Opus 4.7-xhigh on the same 10 under mini-swe-agent, single rollout)
- Beats the Opus 4.7-xhigh per-task mean on 8 of 10 tasks
- Per-rollout aggregate band: 95.8 – 98.0
- 2 confirmed full ✅ solves across 50 task-rollouts, both on
cmatrix(v2-C-1 and v2-C-4, both 508/508) - Best-of-5 aggregate: 98
Two paths. Both produce a submission.tar.gz you can score with uv run programbench eval output/.
Spawns a Claude Code session inside the task's :task_cleanroom container with opus-experiment/CLAUDE.md mounted as project instructions, runs non-interactively to completion, and tars the result.
Prerequisites: docker; Node.js 20+; npm install -g @anthropic-ai/claude-code; an active local Claude Code session (run claude once on the host).
uv pip install programbench
bash scripts/run_claude_code.sh abishekvashok__cmatrix.5c082c6 output/my-run
uv run programbench eval output/my-run --branch-workers 4 --docker-cpus 4
uv run programbench info output/my-runThe script bind-mounts the host's node binary, the host's @anthropic-ai/claude-code package, and ~/.claude (your session credentials) into the task container, then runs claude -p with --permission-mode bypassPermissions and --output-format stream-json. The network sandbox is intentionally loose for the reproducer; the doctrine prompt prohibits source-finding and the agent is expected to comply.
A single-instance / batch runner that bridges mini-SWE-agent to the access token from ~/.claude/.credentials.json. See scripts/README.md.
bash scripts/setup.sh
export CLAUDE_CODE_OAUTH_TOKEN=$(python -c \
'import json,os; print(json.load(open(os.path.expanduser("~/.claude/.credentials.json")))["claudeAiOauth"]["accessToken"])')
uv run python scripts/programbench_mini.py \
--instance-id abishekvashok__cmatrix.5c082c6 \
--output-dir output/my-run \
--model claude-opus-4-7
uv run programbench eval output/my-run
uv run programbench info output/my-run--doctrine is on by default and appends opus-experiment/CLAUDE.md to the paper's anti-cheat system prompt; pass --no-doctrine for the plain paper baseline.
Upstream README:
Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior.
We recommend uv for managing Python environments.
# Run without installing
uvx programbench --help
# Or install into a project
uv pip install programbench
# Or with pip
pip install programbenchFor development:
git clone https://github.com/facebookresearch/programbench.git
cd programbench
uv sync # installs editable + dev dependenciesNote
For more details, please refer to the Usage Guide.
If our work was useful for you, please cite it:
@misc{yang2026programbenchlanguagemodelsrebuild,
title={ProgramBench: Can Language Models Rebuild Programs From Scratch?},
author={John Yang and Kilian Lieret and Jeffrey Ma and Parth Thakkar and Dmitrii Pedchenko and Sten Sootla and Emily McMilin and Pengcheng Yin and Rui Hou and Gabriel Synnaeve and Diyi Yang and Ofir Press},
year={2026},
eprint={2605.03546},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2605.03546},
}ProgramBench is licensed under the terms of the license found in LICENSE.
