Two ways to reproduce a single ProgramBench rollout against our doctrine.
run_claude_code.sh— spawns a non-interactive Claude Code session inside the task's:task_cleanroomcontainer, withopus-experiment/CLAUDE.mdmounted as project instructions. Tars/work/intosubmission.tar.gz. Matches our actual setup.programbench_mini.py— bridges mini-SWE-agent to the access token from your local Claude Code session. Single-instance or batch; runsDefaultAgentwith the paper's anti-cheat system prompt + (by default) our doctrine appended.anthropic_oauth.py— mini-SWE-agent model class that reads the access token from~/.claude/.credentials.jsonand talks to the Anthropic API directly.setup.sh— one-shot bootstrap (installsuvif missing, clonesmini-swe-agent, installs it into the venv).
uv pip install programbench
bash scripts/run_claude_code.sh abishekvashok__cmatrix.5c082c6 output/my-run
uv run programbench eval output/my-run --branch-workers 4 --docker-cpus 4
uv run programbench info output/my-runPrerequisites:
docker- Node.js 20+ on the host (the script bind-mounts the host's
nodebinary into the container) npm install -g @anthropic-ai/claude-codeon the host (bind-mounted in too)- An active Claude Code session locally (
~/.claude/must exist; runclaudeonce if it doesn't)
Override the model with CLAUDE_MODEL=claude-opus-4-7 bash scripts/run_claude_code.sh ....
Faster to set up if you don't have Claude Code installed but do have a session token.
bash scripts/setup.sh
export CLAUDE_CODE_OAUTH_TOKEN=$(python -c \
'import json,os; print(json.load(open(os.path.expanduser("~/.claude/.credentials.json")))["claudeAiOauth"]["accessToken"])')
uv run python scripts/programbench_mini.py \
--instance-id abishekvashok__cmatrix.5c082c6 \
--output-dir output/my-run \
--model claude-opus-4-7
uv run programbench eval output/my-run
uv run programbench info output/my-run--doctrine is on by default. Pass --no-doctrine for the plain paper baseline. Use --config <file.yaml> + --workers N for batch.
- Session tokens expire ~every 8 hours. If a run errors with
401, runclaude -p oklocally to refresh and re-export the env var (path B), or just re-run the script (path A — it reads the live credentials file). - Both paths run the agent inside
:task_cleanroom(no source / no dev headers). Eval uses:task(which has the dev packages installed) so submissions that link system libraries work at scoring time. - The network sandbox in path A is loose for simplicity; the doctrine prompt prohibits source-finding and the agent is expected to comply.