ProgramBench-experiments/exp-traces/figures/writeup.md at main · parsewave/ProgramBench-experiments

In a series of experiments at Parsewave, we ran mean-of-5 evaluations of Claude Opus 4.7 (max effort) with Claude Code as the harness across the 10 easiest tasks on ProgramBench tasks using a Clean-room reverse engineering doctrine, 5 independent rollouts per task. Our mean filtered pass rate landed at 96.9 vs 95.8 for Opus 4.7-xhigh on the same 10 (mini-swe-agent harness, single rollout). We also observed that Opus 4.7 with our doctrine also fully solved the cmatrix task 2 times out of 5 rollouts and beats Opus 4.7-xhigh baseline with mini-swe-agent on 8 tasks out of 10 even on the strict per task mean, not just best-of-5. The recurring failure classes that we noticed across all 10 tasks were CLI argument edge cases in most tasks and TUI based behaviour for some tasks which the agent struggles to audit and replicate before packaging the submission.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FilesExpand file tree

writeup.md

Latest commit

History

writeup.md

File metadata and controls