| pretty_name | ERP-Bench | ||||||
|---|---|---|---|---|---|---|---|
| language |
|
||||||
| tags |
|
||||||
| size_categories |
|
ERP-Bench is the Odoo 19 benchmark used in the Anchor paper, "Preventing Artifact Drift in Agent Benchmark Generation." It contains 300 long-horizon procurement and manufacturing tasks generated from a single solved specification.
Anchor's central claim is that benchmark tasks should not be hand-assembled from separate instructions, environments, oracle solutions, and verifiers. In this repo, each task is compiled from one CP-SAT-backed procurement specification into:
instruction.md: the natural-language business task.environment/: a seeded Odoo database and runtime.solution/: the solver-certified reference plan.tests/: a terminal-state verifier over Odoo records.
The result is a benchmark where rewards are tied to end-state business correctness, not to a particular action trace.
Extended Croissant metadata is provided in croissant.json. It preserves the Hugging Face dataset identity and adds Croissant RAI fields for data collection, intended uses, limitations, known biases, sensitive-information handling, social impact, and release maintenance. The RAI-only overlay is also available in croissant_rai.json.
- 300 generated Harbor tasks in
tasks/. - One Odoo 19 environment per task.
- Procurement and manufacturing workflows spanning 29 task patterns.
- Known optimal solutions generated with OR-Tools CP-SAT.
- Verifiers that score constraint satisfaction, optimality, and traceability.
Example tasks:
tasks/2000_easy_01_buy_only_baseline/
tasks/2053_medium_07_screened_buy_only_mixed_seeded_invoicing/
tasks/2299_hard_repair_plan_hard/
uv sync
uv tool install harborFor Daytona runs:
export DAYTONA_API_KEY=...For model-backed agents, also set the relevant provider key, such as ANTHROPIC_API_KEY.
Use Harbor path mode against the checked-in tasks.
# Reference solution should solve the task.
harbor run -p tasks/2000_easy_01_buy_only_baseline -a oracle --env daytona
# No-op should receive zero reward.
harbor run -p tasks/2000_easy_01_buy_only_baseline -a nop --env daytona
# Run the full benchmark with parallelism.
harbor run -p tasks -a oracle --env daytona -n 10Local Docker fallback:
harbor run -p tasks/2000_easy_01_buy_only_baseline -a oracle --env dockerRun a model-backed agent:
harbor run -p tasks \
-a claude-code \
-m claude-sonnet-4-5-20250929 \
--env daytona \
-n 10The current 300-task dataset is defined by erp_bench/procurement/examples/diverse_300_dataset.toml.
uv run generate-tasks \
--category procurement \
--dataset-config erp_bench/procurement/examples/diverse_300_dataset.toml \
--output tasks \
--forceGenerate a small local batch:
uv run generate-tasks \
--category procurement \
--difficulty easy \
--count 3 \
--start-number 9000 \
--output taskstasks/<task_name>/
├── task.toml
├── instruction.md
├── environment/
│ ├── Dockerfile
│ ├── entrypoint.sh
│ ├── odoo.conf
│ ├── scenario_data.json
│ └── setup_scenario.py
├── tests/
│ ├── test.sh
│ └── checks.py
└── solution/
├── solve.sh
├── solver.py
└── optimal_plan.json
erp_bench/procurement/: procurement configs, sampler, solver, prompts, and objectives.erp_bench/generation/: generic generation CLI and rendering pipeline.erp_bench/templates/procurement/supply_planning/: task artifact templates.schemas/: Pydantic schemas and shared validation.agents/: lightweight pi-based agent harness helpers used for ERP-Bench experiments.
uv run ruff check .
uv run ty checkAfter changing schemas, solver logic, or templates, regenerate affected tasks and validate with both nop and oracle.