A benchmark for evaluating coding agents on real-world Business Central (AL) development tasks, inspired by SWE-Bench.
BC-Bench provides a reproducible evaluation framework for coding agents working on real-world Business Central development tasks:
- Measure performance of different models on authentic AL issues
- Quantify impact of tooling changes (MCP servers, custom instructions, custom agents, etc)
- Track progress with transparent, comparable metrics over time
- Rapidly iterate on agent configurations and setups
We follow the SWE-Bench schema with BC-specific adjustments:
environment_setup_commitandversionare combined intoenvironment_setup_versionproject_pathsto enumerate AL project roots touched by the fixproblem_statementandhints_textare not included in the jsonl file but stored under problemstatement for screenshots in repro steps
A minimal agent loop based on mini-swe-agent. Its simplicity makes it perfect for establishing baseline performance. See mini-bc-agent.
The GitHub Copilot CLI supports MCP servers, tools, and agent mode. It closely simulates real developers' workflow (both VS Code and Coding Agent), making it an ideal candidate for evaluating automated workflows.
Claude Code is Anthropic's agentic coding tool. It supports MCP servers, custom system prompts, and agent mode. BC-Bench integrates with Claude Code using the same shared configuration as Copilot.
BC-Bench is open source, and you're welcome to fork and adapt it for your own use. We are not accepting external contributions in this repository at this time. You can run evaluations locally and replace the dataset under dataset/ with tasks from your own codebase.