Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
11 changes: 10 additions & 1 deletion .github/actions/setup-bc-container/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,16 @@ runs:
# Mask the password in GitHub Actions logs
Write-Output "::add-mask::$password"

"BC_CONTAINER_NAME=bcbench-$("${{ inputs.instance-id }}".Split('-')[1])" | Out-File -FilePath $env:GITHUB_ENV -Append
# Extract numeric ticket ID from instance-id, ignoring __cf-N suffix for counterfactual entries
# e.g. "microsoftInternal__NAV-210528__cf-1" -> "210528", "microsoft__BCApps-4699" -> "4699"
$instanceId = "${{ inputs.instance-id }}"
if ($instanceId -match '[A-Za-z]+-(\d+)') {
$ticketNumber = $Matches[1]
} else {
$ticketNumber = $instanceId.Split('-')[1]
}

"BC_CONTAINER_NAME=bcbench-$ticketNumber" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_USERNAME=admin" | Out-File -FilePath $env:GITHUB_ENV -Append
"BC_CONTAINER_PASSWORD=$password" | Out-File -FilePath $env:GITHUB_ENV -Append
shell: pwsh
Expand Down
87 changes: 87 additions & 0 deletions .github/prompts/create-counterfactual.prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
description: "Create counterfactual (CF) dataset entries for BC-Bench. Provide the base instance_id and describe the code changes for each variant."
mode: agent
---

# Create Counterfactual Dataset Entries

You are helping create counterfactual (CF) entries for the BC-Bench benchmark dataset.

## Context

Read these files first to understand the workflow:
- `COUNTERFACTUAL.md` — authoring guide
- `dataset/bcbench.jsonl` — find the base entry by instance_id
- `dataset/counterfactual.jsonl` — existing CF entries (match format/key ordering)

## Input Required from User

The user will provide:
1. **Base instance_id** — e.g. `microsoftInternal__NAV-224009`
2. **CF variants** — for each variant:
- What code changes to make in `test/after/` (test modifications)
- What code changes to make in `fix/after/` (fix modifications, often unchanged)
- A short variant description
- The intervention type (`test-spec-change`, `fix-scope-change`, etc.)
3. **Problem statement** — either a pre-written README path or content to generate

## Workflow (per variant)

### Step 1: Analyze the base entry
```bash
python -c "import json; [print(json.dumps(json.loads(l), indent=2)) for l in open('dataset/bcbench.jsonl') if '<BASE_ID>' in l]"
```
- Understand the patch (fix) and test_patch (test) diffs
- Read the base problem statement from `dataset/problemstatement/<instance_id>/README.md`

### Step 2: Extract workspace
```bash
uv run bcbench dataset cf-extract <base_instance_id> -o cf-<short-name>
```
- Patch-only mode creates padded files — use `Get-Content ... | Where-Object { $_.Trim() }` to view content

### Step 3: Edit the after/ files
- Apply the user's described code changes to `test/after/` and/or `fix/after/`
- If the fix needs to be **reversed** (e.g. CF removes a filter instead of adding one), swap fix/before and fix/after contents:
```powershell
$before = Get-Content "fix\before\<path>" -Raw
$after = Get-Content "fix\after\<path>" -Raw
Set-Content "fix\before\<path>" -Value $after -NoNewline
Set-Content "fix\after\<path>" -Value $before -NoNewline
```
- Verify edits with `Get-Content ... | Where-Object { $_.Trim() }`

### Step 4: Create the CF entry
```bash
uv run bcbench dataset cf-create ./cf-<short-name> \
-d "<variant description>" \
-t "<intervention-type>"
```

**This command automatically handles:**
- Patch regeneration from before/after files
- `FAIL_TO_PASS` auto-detection from [Test] procedures in test patch
- `PASS_TO_PASS` auto-population from the base entry
- Canonical key ordering in counterfactual.jsonl
- Problem statement directory scaffolding (copies base README **and all image/asset files** as template)

### Step 5: Edit problem statement README
- If user provided a pre-written README, copy it to the scaffolded directory at `dataset/problemstatement/<cf_instance_id>/README.md`
- Otherwise, edit the scaffolded README to describe the variant
- **Images & assets are copied automatically** by `cf-create`. Verify with `Get-ChildItem dataset/problemstatement/<cf_instance_id>/` that all referenced images are present.

### Step 6: Verify
```bash
uv run pytest tests/test_dataset_integrity.py tests/test_counterfactual.py -q
```
Confirm all tests pass. Then briefly show the created entry's key fields.

## Key Rules
- Fix patch is usually **unchanged** from base (same bug fix, different test scenario)
- If the CF requires a **different** fix, the fix/after file should contain the CF's gold fix code
- Test patch is the primary thing that changes between variants
- **No manual key reordering needed** — cf-create handles this automatically
- **No manual PASS_TO_PASS needed** — cf-create copies from base entry automatically
- Problem statement directory naming: `<base_id>__cf-N` (double underscore + hyphen)

{{{ input }}}
1 change: 1 addition & 0 deletions .github/workflows/claude-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "counterfactual-evaluation"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/copilot-evaluation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ on:
options:
- "bug-fix"
- "test-generation"
- "counterfactual-evaluation"
test-run:
description: "Indicate this is a test run (with few entries)"
required: false
Expand Down
9 changes: 9 additions & 0 deletions .github/workflows/get-entries.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@ on:
required: false
type: boolean
default: false
include-counterfactual:
description: Include counterfactual entries from counterfactual.jsonl
required: false
type: boolean
default: true
outputs:
entries:
description: JSON array of dataset entries
Expand Down Expand Up @@ -45,4 +50,8 @@ jobs:
cmd="$cmd --test-run"
fi

if [[ "${{ inputs.include-counterfactual }}" == "false" ]]; then
cmd="$cmd --no-include-counterfactual"
fi

eval "$cmd"
187 changes: 187 additions & 0 deletions COUNTERFACTUAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Counterfactual Dataset Authoring

This guide explains **what counterfactual (CF) entries are** and how to create them using the `bcbench dataset` CLI commands.

## What Are Counterfactual Entries?

A counterfactual entry is a **variant** of an existing base benchmark entry. It reuses the same repository state (repo, base commit, project paths) but provides a **different fix and test pair** — testing whether an agent can solve a related-but-different version of the same bug.

Each CF entry lives in [`dataset/counterfactual.jsonl`](dataset/counterfactual.jsonl) and references a base entry from [`dataset/bcbench.jsonl`](dataset/bcbench.jsonl).

**Example:** Base entry tests that all 4 emission fields are enabled. A CF variant tests that only 3 of 4 fields are required.

### Naming Convention

CF entries follow the pattern: `<base_instance_id>__cf-<N>`

```
microsoftInternal__NAV-210528 ← base entry
microsoftInternal__NAV-210528__cf-1 ← first counterfactual variant
microsoftInternal__NAV-210528__cf-2 ← second variant
```

## Authoring Workflow

The workflow has two steps: **extract** a workspace, **edit** the code, then **create** the CF entry.

### Step 1: Extract a workspace

```bash
uv run bcbench dataset cf-extract <base_entry_id> --output-dir ./my-cf-workspace
```

This creates a workspace directory with editable AL files:

```
my-cf-workspace/
├── fix/
│ ├── before/ # Original code before the fix
│ │ └── <path>.al
│ └── after/ # Fixed code — EDIT THIS
│ └── <path>.al
├── test/
│ ├── before/ # Original test code before the fix
│ │ └── <path>.al
│ └── after/ # Test code — EDIT THIS
│ └── <path>.al
└── workspace.json # Metadata (entry ID, file list, mode)
```

**Options:**

| Flag | Description |
| -------------------- | ------------------------------------------------------------------------------------------ |
| `--output-dir`, `-o` | Directory to create workspace in (default: `cf-workspace`) |
| `--repo-path`, `-r` | Path to cloned repo for full-fidelity extraction (extracts complete files, not just hunks) |

**Modes:**

- **Patch-only** (default, no `--repo-path`): Reconstructs files from patch hunks only. Files are padded with empty lines to preserve original line numbers. Fast, no repo needed.
- **Repo-based** (`--repo-path` provided): Checks out the base commit, copies full before/after files. Full fidelity, but requires a local clone.

### Step 2: Edit the code

Open the workspace and modify the `after/` files:

- **`fix/after/`** — Change the fix (the code the agent needs to produce)
- **`test/after/`** — Change the tests (what defines success/failure)

Leave the `before/` files unchanged — they represent the original state.

### Step 3: Create the CF entry

```bash
uv run bcbench dataset cf-create ./my-cf-workspace \
--variant-description "Only 3 of 4 emission fields required" \
--intervention-type "test-spec-change"
```

This command:
1. Regenerates patches from your edited `before/` and `after/` files
2. Auto-detects `FAIL_TO_PASS` test procedures from the test patch
3. Assigns the next available `__cf-N` ID
4. Scaffolds a problem statement directory (copies base entry's README.md as template)
5. Appends the new entry to `dataset/counterfactual.jsonl`

**Options:**

| Flag | Description |
| ----------------------------- | ----------------------------------------------------------------------------- |
| `--variant-description`, `-d` | **Required.** Description of what this variant changes |
| `--intervention-type`, `-t` | Optional. Type of intervention (e.g., `test-spec-change`, `fix-scope-change`) |

### Step 4: Edit the problem statement

After creation, edit the scaffolded problem statement:

```
dataset/problemstatement/<entry_id>/README.md
```

This is copied from the base entry — update it to describe the counterfactual variant's specific requirements.

### Step 5: Commit and PR

```bash
git add dataset/counterfactual.jsonl dataset/problemstatement/
git commit -m "Add counterfactual variant: <description>"
```

## Full Example

```bash
# 1. Extract workspace from a base entry
uv run bcbench dataset cf-extract microsoftInternal__NAV-210528 --output-dir ./cf-sustainability

# 2. Edit the test to only check 3 emission fields instead of 4
# (open cf-sustainability/test/after/...SustCertificateTest.Codeunit.al and edit)

# 3. Edit the fix to only enable 3 fields
# (open cf-sustainability/fix/after/...SustainabilitySetup.Table.al and edit)

# 4. Create the CF entry
uv run bcbench dataset cf-create ./cf-sustainability \
-d "Only 3 of 4 emission fields required: omits Work/Machine Center Emissions" \
-t "test-spec-change"

# 5. Edit the problem statement
# (edit dataset/problemstatement/microsoftInternal__NAV-210528__cf-1/README.md)

# 6. Commit
git add dataset/ && git commit -m "Add CF variant for NAV-210528"
```

## Evaluating CF Entries

CF entries are evaluated using the same pipeline as bug-fix entries:

```bash
# Run agent on a CF entry
uv run bcbench run copilot microsoftInternal__NAV-210528__cf-1 \
--category counterfactual-evaluation \
--repo-path /path/to/NAV

# Full evaluation (build + test)
uv run bcbench evaluate copilot microsoftInternal__NAV-210528__cf-1 \
--category counterfactual-evaluation \
--repo-path /path/to/NAV
```

The `--category counterfactual-evaluation` flag tells BC-Bench to use the CF entry's patches and tests for evaluation. The system auto-detects CF entries by their `__cf-N` suffix.

## Listing CF Entries

```bash
# List all entries (includes CF entries by default)
uv run bcbench dataset list

# List without CF entries
uv run bcbench dataset list --no-include-counterfactual
```

## File Reference

| File | Purpose |
| ---------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| [`dataset/counterfactual.jsonl`](dataset/counterfactual.jsonl) | All CF entries (one JSON per line) |
| [`dataset/problemstatement/<id>/`](dataset/problemstatement/) | Problem statement for each CF entry |
| [`src/bcbench/dataset/cf_workspace.py`](src/bcbench/dataset/cf_workspace.py) | Core logic: extraction, patch regeneration, entry creation |
| [`src/bcbench/dataset/counterfactual_entry.py`](src/bcbench/dataset/counterfactual_entry.py) | CF entry Pydantic model |
| [`src/bcbench/dataset/counterfactual_loader.py`](src/bcbench/dataset/counterfactual_loader.py) | Loader for CF entries |
| [`src/bcbench/commands/dataset.py`](src/bcbench/commands/dataset.py) | CLI commands (`cf-extract`, `cf-create`) |

## CF Entry Schema

Each line in `counterfactual.jsonl` contains:

| Field | Description |
| ---------------------------- | ------------------------------------------------- |
| `instance_id` | `<base_id>__cf-<N>` — unique identifier |
| `base_instance_id` | ID of the base entry this variant is derived from |
| `variant_description` | Human-readable description of the variant |
| `intervention_type` | Optional categorization of the change type |
| `patch` | The counterfactual fix patch |
| `test_patch` | The counterfactual test patch |
| `FAIL_TO_PASS` | Tests that must fail before fix, pass after |
| `PASS_TO_PASS` | Tests that must pass both before and after |
| `problem_statement_override` | Path to the CF-specific problem statement |
Loading
Loading