ProgramBench-experiments

Parsewave experiments on the ProgramBench reverse-engineering benchmark.

A fork of facebookresearch/programbench with our agent doctrine and rollout traces overlaid. Upstream README continues below.

Experiment results

The headline experiment in this repo is a mean-of-5 evaluation of Claude Opus 4.7 (max effort) on the easy-10 ProgramBench subset, using a clean-room reverse-engineering doctrine.

Scoreboard + methodology: exp-traces/README-best-of-5.md
Per-rollout traces (10 tasks × 5 rollouts = 50 task-traces): under exp-traces/cc-opus-easy10-v2-C-{1..5}/, each task dir has the full trajectory.jsonl, submission.tar.gz, and eval.json.
Doctrine: opus-experiment/CLAUDE.md — the clean-room RE prompt used for all rollouts.
Charts: exp-traces/figures/ — per-task scores, leaderboard comparison, per-rollout variance.

Headlines (filtered pass rate per programbench info):

Mean across 5 rollouts × 10 tasks: 96.9 (vs 95.8 for Claude Opus 4.7-xhigh on the same 10 under mini-swe-agent, single rollout)
Beats the Opus 4.7-xhigh per-task mean on 8 of 10 tasks
Per-rollout aggregate band: 95.8 – 98.0
2 confirmed full ✅ solves across 50 task-rollouts, both on cmatrix (v2-C-1 and v2-C-4, both 508/508)
Best-of-5 aggregate: 98

Reproducing a single rollout

Two paths. Both produce a submission.tar.gz you can score with uv run programbench eval output/.

A. Via Claude Code (matches our setup)

Spawns a Claude Code session inside the task's :task_cleanroom container with opus-experiment/CLAUDE.md mounted as project instructions, runs non-interactively to completion, and tars the result.

Prerequisites: docker; Node.js 20+; npm install -g @anthropic-ai/claude-code; an active local Claude Code session (run claude once on the host).

uv pip install programbench

bash scripts/run_claude_code.sh abishekvashok__cmatrix.5c082c6 output/my-run

uv run programbench eval output/my-run --branch-workers 4 --docker-cpus 4
uv run programbench info output/my-run

The script bind-mounts the host's node binary, the host's @anthropic-ai/claude-code package, and ~/.claude (your session credentials) into the task container, then runs claude -p with --permission-mode bypassPermissions and --output-format stream-json. The network sandbox is intentionally loose for the reproducer; the doctrine prompt prohibits source-finding and the agent is expected to comply.

B. Via mini-SWE-agent (uses your Claude Code session token, no Claude Code CLI required)

A single-instance / batch runner that bridges mini-SWE-agent to the access token from ~/.claude/.credentials.json. See scripts/README.md.

bash scripts/setup.sh

export CLAUDE_CODE_OAUTH_TOKEN=$(python -c \
  'import json,os; print(json.load(open(os.path.expanduser("~/.claude/.credentials.json")))["claudeAiOauth"]["accessToken"])')

uv run python scripts/programbench_mini.py \
  --instance-id abishekvashok__cmatrix.5c082c6 \
  --output-dir output/my-run \
  --model claude-opus-4-7

uv run programbench eval output/my-run
uv run programbench info output/my-run

--doctrine is on by default and appends opus-experiment/CLAUDE.md to the paper's anti-cheat system prompt; pass --no-doctrine for the plain paper baseline.

Upstream README:

Given only a compiled binary and its documentation, AI agents must architect and implement a complete codebase that reproduces the original program's behavior.

Links

Quickstart

We recommend uv for managing Python environments.

# Run without installing
uvx programbench --help

# Or install into a project
uv pip install programbench

# Or with pip
pip install programbench

For development:

git clone https://github.com/facebookresearch/programbench.git
cd programbench
uv sync  # installs editable + dev dependencies

Note

For more details, please refer to the Usage Guide.

Citation

If our work was useful for you, please cite it:

@misc{yang2026programbenchlanguagemodelsrebuild,
    title={ProgramBench: Can Language Models Rebuild Programs From Scratch?},
    author={John Yang and Kilian Lieret and Jeffrey Ma and Parth Thakkar and Dmitrii Pedchenko and Sten Sootla and Emily McMilin and Pengcheng Yin and Rui Hou and Gabriel Synnaeve and Diyi Yang and Ofir Press},
    year={2026},
    eprint={2605.03546},
    archivePrefix={arXiv},
    primaryClass={cs.SE},
    url={https://arxiv.org/abs/2605.03546},
}

License

ProgramBench is licensed under the terms of the license found in LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
docs		docs
exp-traces		exp-traces
opus-experiment		opus-experiment
scripts		scripts
src/programbench		src/programbench
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProgramBench-experiments

Experiment results

Reproducing a single rollout

A. Via Claude Code (matches our setup)

B. Via mini-SWE-agent (uses your Claude Code session token, no Claude Code CLI required)

Links

Quickstart

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ProgramBench-experiments

Experiment results

Reproducing a single rollout

A. Via Claude Code (matches our setup)

B. Via mini-SWE-agent (uses your Claude Code session token, no Claude Code CLI required)

Links

Quickstart

Citation

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages