microsoft · Jiawen-CS · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/.github/actions/setup-bc-container/action.yml b/.github/actions/setup-bc-container/action.yml
@@ -32,7 +32,16 @@ runs:
         # Mask the password in GitHub Actions logs
         Write-Output "::add-mask::$password"
 
-        "BC_CONTAINER_NAME=bcbench-$("${{ inputs.instance-id }}".Split('-')[1])" | Out-File -FilePath $env:GITHUB_ENV -Append
+        # Extract numeric ticket ID from instance-id, ignoring __cf-N suffix for counterfactual entries
+        # e.g. "microsoftInternal__NAV-210528__cf-1" -> "210528", "microsoft__BCApps-4699" -> "4699"
+        $instanceId = "${{ inputs.instance-id }}"
+        if ($instanceId -match '[A-Za-z]+-(\d+)') {
+            $ticketNumber = $Matches[1]
+        } else {
+            $ticketNumber = $instanceId.Split('-')[1]
+        }
+
+        "BC_CONTAINER_NAME=bcbench-$ticketNumber" | Out-File -FilePath $env:GITHUB_ENV -Append
         "BC_CONTAINER_USERNAME=admin" | Out-File -FilePath $env:GITHUB_ENV -Append
         "BC_CONTAINER_PASSWORD=$password" | Out-File -FilePath $env:GITHUB_ENV -Append
       shell: pwsh

diff --git a/.github/prompts/create-counterfactual.prompt.md b/.github/prompts/create-counterfactual.prompt.md
@@ -0,0 +1,87 @@
+---
+description: "Create counterfactual (CF) dataset entries for BC-Bench. Provide the base instance_id and describe the code changes for each variant."
+mode: agent
+---
+
+# Create Counterfactual Dataset Entries
+
+You are helping create counterfactual (CF) entries for the BC-Bench benchmark dataset.
+
+## Context
+
+Read these files first to understand the workflow:
+- `COUNTERFACTUAL.md` — authoring guide
+- `dataset/bcbench.jsonl` — find the base entry by instance_id
+- `dataset/counterfactual.jsonl` — existing CF entries (match format/key ordering)
+
+## Input Required from User
+
+The user will provide:
+1. **Base instance_id** — e.g. `microsoftInternal__NAV-224009`
+2. **CF variants** — for each variant:
+   - What code changes to make in `test/after/` (test modifications)
+   - What code changes to make in `fix/after/` (fix modifications, often unchanged)
+   - A short variant description
+   - The intervention type (`test-spec-change`, `fix-scope-change`, etc.)
+3. **Problem statement** — either a pre-written README path or content to generate
+
+## Workflow (per variant)
+
+### Step 1: Analyze the base entry
+```bash
+python -c "import json; [print(json.dumps(json.loads(l), indent=2)) for l in open('dataset/bcbench.jsonl') if '<BASE_ID>' in l]"
+```
+- Understand the patch (fix) and test_patch (test) diffs
+- Read the base problem statement from `dataset/problemstatement/<instance_id>/README.md`
+
+### Step 2: Extract workspace
+```bash
+uv run bcbench dataset cf-extract <base_instance_id> -o cf-<short-name>
+```
+- Patch-only mode creates padded files — use `Get-Content ... | Where-Object { $_.Trim() }` to view content
+
+### Step 3: Edit the after/ files
+- Apply the user's described code changes to `test/after/` and/or `fix/after/`
+- If the fix needs to be **reversed** (e.g. CF removes a filter instead of adding one), swap fix/before and fix/after contents:
+  ```powershell
+  $before = Get-Content "fix\before\<path>" -Raw
+  $after = Get-Content "fix\after\<path>" -Raw
+  Set-Content "fix\before\<path>" -Value $after -NoNewline
+  Set-Content "fix\after\<path>" -Value $before -NoNewline
+  ```
+- Verify edits with `Get-Content ... | Where-Object { $_.Trim() }`
+
+### Step 4: Create the CF entry
+```bash
+uv run bcbench dataset cf-create ./cf-<short-name> \
+  -d "<variant description>" \
+  -t "<intervention-type>"
+```
+
+**This command automatically handles:**
+- Patch regeneration from before/after files
+- `FAIL_TO_PASS` auto-detection from [Test] procedures in test patch
+- `PASS_TO_PASS` auto-population from the base entry
+- Canonical key ordering in counterfactual.jsonl
+- Problem statement directory scaffolding (copies base README **and all image/asset files** as template)
+
+### Step 5: Edit problem statement README
+- If user provided a pre-written README, copy it to the scaffolded directory at `dataset/problemstatement/<cf_instance_id>/README.md`
+- Otherwise, edit the scaffolded README to describe the variant
+- **Images & assets are copied automatically** by `cf-create`. Verify with `Get-ChildItem dataset/problemstatement/<cf_instance_id>/` that all referenced images are present.
+
+### Step 6: Verify
+```bash
+uv run pytest tests/test_dataset_integrity.py tests/test_counterfactual.py -q
+```
+Confirm all tests pass. Then briefly show the created entry's key fields.
+
+## Key Rules
+- Fix patch is usually **unchanged** from base (same bug fix, different test scenario)
+- If the CF requires a **different** fix, the fix/after file should contain the CF's gold fix code
+- Test patch is the primary thing that changes between variants
+- **No manual key reordering needed** — cf-create handles this automatically
+- **No manual PASS_TO_PASS needed** — cf-create copies from base entry automatically
+- Problem statement directory naming: `<base_id>__cf-N` (double underscore + hyphen)
+
+{{{ input }}}
diff --git a/.github/workflows/claude-evaluation.yml b/.github/workflows/claude-evaluation.yml
@@ -23,6 +23,7 @@ on:
         options:
           - "bug-fix"
           - "test-generation"
+          - "counterfactual-evaluation"
       test-run:
         description: "Indicate this is a test run (with few entries)"
         required: false

diff --git a/.github/workflows/copilot-evaluation.yml b/.github/workflows/copilot-evaluation.yml
@@ -31,6 +31,7 @@ on:
         options:
           - "bug-fix"
           - "test-generation"
+          - "counterfactual-evaluation"
       test-run:
         description: "Indicate this is a test run (with few entries)"
         required: false

diff --git a/.github/workflows/get-entries.yml b/.github/workflows/get-entries.yml
@@ -15,6 +15,11 @@ on:
         required: false
         type: boolean
         default: false
+      include-counterfactual:
+        description: Include counterfactual entries from counterfactual.jsonl
+        required: false
+        type: boolean
+        default: true
     outputs:
       entries:
         description: JSON array of dataset entries
@@ -45,4 +50,8 @@ jobs:
             cmd="$cmd --test-run"
           fi
 
+          if [[ "${{ inputs.include-counterfactual }}" == "false" ]]; then
+            cmd="$cmd --no-include-counterfactual"
+          fi
+
           eval "$cmd"
diff --git a/COUNTERFACTUAL.md b/COUNTERFACTUAL.md
@@ -0,0 +1,187 @@
+# Counterfactual Dataset Authoring
+
+This guide explains **what counterfactual (CF) entries are** and how to create them using the `bcbench dataset` CLI commands.
+
+## What Are Counterfactual Entries?
+
+A counterfactual entry is a **variant** of an existing base benchmark entry. It reuses the same repository state (repo, base commit, project paths) but provides a **different fix and test pair** — testing whether an agent can solve a related-but-different version of the same bug.
+
+Each CF entry lives in [`dataset/counterfactual.jsonl`](dataset/counterfactual.jsonl) and references a base entry from [`dataset/bcbench.jsonl`](dataset/bcbench.jsonl).
+
+**Example:** Base entry tests that all 4 emission fields are enabled. A CF variant tests that only 3 of 4 fields are required.
+
+### Naming Convention
+
+CF entries follow the pattern: `<base_instance_id>__cf-<N>`
+
+```
+microsoftInternal__NAV-210528         ← base entry
+microsoftInternal__NAV-210528__cf-1   ← first counterfactual variant
+microsoftInternal__NAV-210528__cf-2   ← second variant
+```
+
+## Authoring Workflow
+
+The workflow has two steps: **extract** a workspace, **edit** the code, then **create** the CF entry.
+
+### Step 1: Extract a workspace
+
+```bash
+uv run bcbench dataset cf-extract <base_entry_id> --output-dir ./my-cf-workspace
+```
+
+This creates a workspace directory with editable AL files:
+
+```
+my-cf-workspace/
+├── fix/
+│   ├── before/    # Original code before the fix
+│   │   └── <path>.al
+│   └── after/     # Fixed code — EDIT THIS
+│       └── <path>.al
+├── test/
+│   ├── before/    # Original test code before the fix
+│   │   └── <path>.al
+│   └── after/     # Test code — EDIT THIS
+│       └── <path>.al
+└── workspace.json  # Metadata (entry ID, file list, mode)
+```
+
+**Options:**
+
+| Flag                 | Description                                                                                |
+| -------------------- | ------------------------------------------------------------------------------------------ |
+| `--output-dir`, `-o` | Directory to create workspace in (default: `cf-workspace`)                                 |
+| `--repo-path`, `-r`  | Path to cloned repo for full-fidelity extraction (extracts complete files, not just hunks) |
+
+**Modes:**
+
+- **Patch-only** (default, no `--repo-path`): Reconstructs files from patch hunks only. Files are padded with empty lines to preserve original line numbers. Fast, no repo needed.
+- **Repo-based** (`--repo-path` provided): Checks out the base commit, copies full before/after files. Full fidelity, but requires a local clone.
+
+### Step 2: Edit the code
+
+Open the workspace and modify the `after/` files:
+
+- **`fix/after/`** — Change the fix (the code the agent needs to produce)
+- **`test/after/`** — Change the tests (what defines success/failure)
+
+Leave the `before/` files unchanged — they represent the original state.
+
+### Step 3: Create the CF entry
+
+```bash
+uv run bcbench dataset cf-create ./my-cf-workspace \
+  --variant-description "Only 3 of 4 emission fields required" \
+  --intervention-type "test-spec-change"
+```
+
+This command:
+1. Regenerates patches from your edited `before/` and `after/` files
+2. Auto-detects `FAIL_TO_PASS` test procedures from the test patch
+3. Assigns the next available `__cf-N` ID
+4. Scaffolds a problem statement directory (copies base entry's README.md as template)
+5. Appends the new entry to `dataset/counterfactual.jsonl`
+
+**Options:**
+
+| Flag                          | Description                                                                   |
+| ----------------------------- | ----------------------------------------------------------------------------- |
+| `--variant-description`, `-d` | **Required.** Description of what this variant changes                        |
+| `--intervention-type`, `-t`   | Optional. Type of intervention (e.g., `test-spec-change`, `fix-scope-change`) |
+
+### Step 4: Edit the problem statement
+
+After creation, edit the scaffolded problem statement:
+
+```
+dataset/problemstatement/<entry_id>/README.md
+```
+
+This is copied from the base entry — update it to describe the counterfactual variant's specific requirements.
+
+### Step 5: Commit and PR
+
+```bash
+git add dataset/counterfactual.jsonl dataset/problemstatement/
+git commit -m "Add counterfactual variant: <description>"
+```
+
+## Full Example
+
+```bash
+# 1. Extract workspace from a base entry
+uv run bcbench dataset cf-extract microsoftInternal__NAV-210528 --output-dir ./cf-sustainability
+
+# 2. Edit the test to only check 3 emission fields instead of 4
+#    (open cf-sustainability/test/after/...SustCertificateTest.Codeunit.al and edit)
+
+# 3. Edit the fix to only enable 3 fields
+#    (open cf-sustainability/fix/after/...SustainabilitySetup.Table.al and edit)
+
+# 4. Create the CF entry
+uv run bcbench dataset cf-create ./cf-sustainability \
+  -d "Only 3 of 4 emission fields required: omits Work/Machine Center Emissions" \
+  -t "test-spec-change"
+
+# 5. Edit the problem statement
+# (edit dataset/problemstatement/microsoftInternal__NAV-210528__cf-1/README.md)
+
+# 6. Commit
+git add dataset/ && git commit -m "Add CF variant for NAV-210528"
+```
+
+## Evaluating CF Entries
+
+CF entries are evaluated using the same pipeline as bug-fix entries:
+
+```bash
+# Run agent on a CF entry
+uv run bcbench run copilot microsoftInternal__NAV-210528__cf-1 \
+  --category counterfactual-evaluation \
+  --repo-path /path/to/NAV
+
+# Full evaluation (build + test)
+uv run bcbench evaluate copilot microsoftInternal__NAV-210528__cf-1 \
+  --category counterfactual-evaluation \
+  --repo-path /path/to/NAV
+```
+
+The `--category counterfactual-evaluation` flag tells BC-Bench to use the CF entry's patches and tests for evaluation. The system auto-detects CF entries by their `__cf-N` suffix.
+
+## Listing CF Entries
+
+```bash
+# List all entries (includes CF entries by default)
+uv run bcbench dataset list
+
+# List without CF entries
+uv run bcbench dataset list --no-include-counterfactual
+```
+
+## File Reference
+
+| File                                                                                           | Purpose                                                    |
+| ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
+| [`dataset/counterfactual.jsonl`](dataset/counterfactual.jsonl)                                 | All CF entries (one JSON per line)                         |
+| [`dataset/problemstatement/<id>/`](dataset/problemstatement/)                                  | Problem statement for each CF entry                        |
+| [`src/bcbench/dataset/cf_workspace.py`](src/bcbench/dataset/cf_workspace.py)                   | Core logic: extraction, patch regeneration, entry creation |
+| [`src/bcbench/dataset/counterfactual_entry.py`](src/bcbench/dataset/counterfactual_entry.py)   | CF entry Pydantic model                                    |
+| [`src/bcbench/dataset/counterfactual_loader.py`](src/bcbench/dataset/counterfactual_loader.py) | Loader for CF entries                                      |
+| [`src/bcbench/commands/dataset.py`](src/bcbench/commands/dataset.py)                           | CLI commands (`cf-extract`, `cf-create`)                   |
+
+## CF Entry Schema
+
+Each line in `counterfactual.jsonl` contains:
+
+| Field                        | Description                                       |
+| ---------------------------- | ------------------------------------------------- |
+| `instance_id`                | `<base_id>__cf-<N>` — unique identifier           |
+| `base_instance_id`           | ID of the base entry this variant is derived from |
+| `variant_description`        | Human-readable description of the variant         |
+| `intervention_type`          | Optional categorization of the change type        |
+| `patch`                      | The counterfactual fix patch                      |
+| `test_patch`                 | The counterfactual test patch                     |
+| `FAIL_TO_PASS`               | Tests that must fail before fix, pass after       |
+| `PASS_TO_PASS`               | Tests that must pass both before and after        |
+| `problem_statement_override` | Path to the CF-specific problem statement         |