bc1plainview · bc1plainview · Mar 13, 2026 · Mar 13, 2026 · Mar 13, 2026
diff --git a/.claude-plugin/plugin.json b/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "buidl",
-  "version": "6.0.0",
-  "description": "Full dev lifecycle for OP_NET Bitcoin L1 projects: idea → challenge → spec → build → review → ship. Self-learning across sessions with pattern extraction, agent performance scoring, score-based finding routing, project-type profiles, cross-layer validation, and starter templates. Includes shell-enforced E2E testing gates, frontend runtime smoke checks, PUA problem-solving methodology, the OP_NET Bible (2000+ lines), cross-agent critique, adversarial auditing, adversarial E2E testing, ABI-lock checkpoints, findings ledger with regression tracking, acceptance test generation, chain probe, hard gate enforcement, incremental audits, dry-run mode, execution tracing, dynamic re-planning from learned patterns, dynamic knowledge slice loading, property-based fuzz testing, and stale pattern pruning. Agents get smarter with every project.",
+  "version": "7.0.0",
+  "description": "Full dev lifecycle for OP_NET Bitcoin L1 projects: idea → challenge → spec → build → review → ship. Self-learning across sessions with pattern extraction, agent performance scoring, score-based finding routing, project-type profiles, cross-layer validation, and starter templates. Includes shell-enforced E2E testing gates, frontend runtime smoke checks, PUA problem-solving methodology, the OP_NET Bible (2000+ lines), cross-agent critique, adversarial auditing, adversarial E2E testing, ABI-lock checkpoints, findings ledger with regression tracking, acceptance test generation, chain probe, hard gate enforcement, incremental audits, dry-run mode, execution tracing, dynamic re-planning from learned patterns, dynamic knowledge slice loading, property-based fuzz testing, stale pattern pruning, mutation testing as loop exit gate, structured repair phases (R1/R2/R3), goal-oriented build evaluation, hierarchical repo map, and autoresearch optimize mode. Agents get smarter with every project.",
   "author": {
     "name": "dannyplainview + bob"
   }

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,30 @@
 # Changelog
 
+## [7.0.0] - 2026-03-13
+
+### Added
+- **Mutation testing as loop exit gate** (`scripts/mutate-contract.sh`): Applies 20 sed-level mutation operators to contract source files. For each mutant: creates a temp copy, applies the mutation, compiles, runs tests. If tests fail, the mutant is killed (good). If tests pass or compilation fails, the mutant survived (bad). Outputs `artifacts/testing/mutation-score.json` with total_mutants, killed, survived, compile_errors, mutation_score (0-1), threshold (0.70), verdict (PASS/FAIL), and survivors list. Phase 5 runs this gate before the reviewer -- score below 0.70 routes back to contract-dev with the survivors list.
+- **Structured repair phases** (Agentless R1/R2/R3 pattern): Replaces "re-run agent with failure context" with three targeted phases. Phase R1 (LOCALIZE): max_turns 5, READ-ONLY, reviewer in localize mode produces localization.json. Phase R2 (PATCH): max_turns 10, domain agent receives localized context only, generates up to 3 candidate patches. Phase R3 (VALIDATE): automated, runs tests and mutation on each candidate, picks the best.
+- **Failure localization script** (`scripts/localize-failure.sh`): Parses failure logs to extract file, function, line_range, suspected_cause, confidence, and failure_category. Outputs `artifacts/localization.json`.
+- **Localize Mode** (`agents/loop-reviewer.md`): New reviewer mode for Phase R1 -- strict 5-turn READ-ONLY process. Produces localization.json only. Code generation is FORBIDDEN.
+- **Goal-oriented build evaluation** (`scripts/score-build.sh`): Evaluates builds across 4 dimensions: spec_coverage (requirements with tests / total, threshold 90%), security_delta (open findings count, threshold 0), mutation_score (from mutation-score.json, threshold 70%), code_health (100 minus weighted penalties, threshold 60%). Outputs `artifacts/evaluation/progress-tracker.yaml`. All thresholds must be met. Failed dimensions route to responsible agents.
+- **Requirements extraction** (`scripts/extract-requirements.sh`): Parses requirements.md and extracts individual requirements into `artifacts/evaluation/spec-requirements.yaml` with id, description, has_test, and priority fields.
+- **Hierarchical cross-layer repo map** (`scripts/build-repo-map.sh`): Generates `artifacts/repo-map.md` with Contract Layer (from abi.json: methods, events, storage slots), Frontend Layer (components, hooks, services, contract calls), Backend Layer (routes, services, contract calls), and Cross-Layer Integrity Checks (missing methods, uncalled methods). Target under 300 lines.
+- **Autoresearch optimize mode** (`commands/buidl-optimize.md`): New `/buidl-optimize` command for automated metric optimization. Supports gas, bundle_size, test_time, and throughput metrics. Runs a hypothesis-implement-benchmark-keep/revert cycle up to 10 times. Outputs summary.md, best-result.json, and auto-creates a PR with kept changes.
+
+### Changed
+- **Orchestrator Phase 5** (`commands/buidl.md`): Mutation gate runs before reviewer dispatch. If mutation score < 0.70, routes back to contract-dev with survivors. Score-build runs after each review cycle, displaying a compact 4-dimension score table. All thresholds must be met for build completion.
+- **Orchestrator agent failure handling** (`commands/buidl.md`): Agent failures now go through R1/R2/R3 structured repair before falling back to manual options. Localization produces targeted context, domain agents generate candidate patches, and validation picks the best one automatically.
+- **Orchestrator Phase 4** (`commands/buidl.md`): Repo map generated after ABI lock (contract layer only), regenerated after all builders complete (all layers populated).
+- **Orchestrator FAIL routing** (`commands/buidl.md`): Uses R1/R2/R3 structured repair for targeted fixes instead of raw agent re-dispatch with full failure context.
+- **All 12 domain agent files**: Updated Step 0 / knowledge loading to reference `artifacts/repo-map.md` for cross-layer context. Agents: cross-layer-validator, loop-builder, loop-explorer, loop-researcher, loop-reviewer, opnet-auditor, opnet-backend-dev, opnet-contract-dev, opnet-deployer, opnet-e2e-tester, opnet-frontend-dev, opnet-ui-tester.
+- **buidl-status** (`commands/buidl-status.md`): Shows mutation score ("Mutation: 83% (15/18 killed)") and 4-dimension build score card when available. Steps renumbered from 7-10 to 9-12.
+- **loop-reviewer** (`agents/loop-reviewer.md`): Added Localize Mode section after Critique Mode.
+- **Plugin version**: 6.0.0 -> 7.0.0
+
+### Why
+Four gaps identified in the build verification and repair systems. (1) The loop had no way to measure test quality -- tests could pass while missing entire categories of bugs. Mutation testing quantifies test effectiveness by checking whether tests detect deliberate code changes. (2) When agents failed, the entire failure context was re-injected, leading to unfocused repair attempts. Structured R1/R2/R3 phases localize the failure first, then generate targeted patches, then validate them automatically. (3) The reviewer produced a single PASS/FAIL verdict with no multi-dimensional visibility. Goal-oriented evaluation scores across 4 dimensions (spec coverage, security, mutation, code health) with clear thresholds and routing for each. (4) Agents had no shared map of how contract methods connected to frontend calls and backend routes. The hierarchical repo map provides cross-layer visibility, and integrity checks automatically detect missing or extra method calls.
+
 ## [6.0.0] - 2026-03-13
 
 ### Added

diff --git a/README.md b/README.md
@@ -69,6 +69,7 @@ alias claudeyproj="claude --dangerously-skip-permissions --plugin-dir /path/to/b
 | `/buidl-clean` | Cancel + remove worktree and branch |
 | `/buidl-trace` | Show agent execution trace timeline for the current session |
 | `/buidl-learning` | Show learning system health report (patterns, scores, profiles) |
+| `/buidl-optimize <metric>` | Optimize gas, bundle_size, test_time, or throughput via automated experimentation |
 
 ### Flags
 
@@ -143,6 +144,21 @@ Agents no longer load the full 2000-line bible regardless of role. `scripts/load
 #### Property-Based Fuzz Testing
 `scripts/fuzz-contract.sh` reads a contract ABI, extracts method signatures and parameter types, and generates boundary test cases: u256 values [0, 1, 2^128, 2^256-1, 2^256-2], address values [zero, contract, caller], bool values [true, false]. Produces all single-param boundary combinations plus 10 random combinations per method. Output goes to `artifacts/testing/fuzz-cases.json` and feeds into both the adversarial auditor and adversarial E2E tester. Does not send transactions.
 
+#### Mutation Testing Gate
+`scripts/mutate-contract.sh` applies 20 mutation operators to contract source and checks whether tests catch each mutation. Mutations include arithmetic swaps (add/sub, mul/div), comparison inversions, boolean flips, revert removal, constant swaps, and event removal. The mutation score (killed/total) must be >= 70% to proceed to review. Surviving mutants are routed back to contract-dev with specific details of what the tests missed.
+
+#### Structured Repair Phases (R1/R2/R3)
+When agents fail, repair follows three targeted phases instead of blindly re-running with full context. R1 (LOCALIZE): the reviewer in localize mode identifies the exact file, function, and line range. R2 (PATCH): the domain agent receives only the localized context and generates up to 3 candidate patches. R3 (VALIDATE): patches are tested and scored automatically, and the best one is applied.
+
+#### Goal-Oriented Build Evaluation
+`scripts/score-build.sh` evaluates builds across 4 dimensions: spec coverage (requirements with tests), security delta (open findings), mutation score, and code health. Each dimension has a threshold and routes to the responsible agent on failure. The compact score table is displayed after every review cycle.
+
+#### Hierarchical Repo Map
+`scripts/build-repo-map.sh` generates a cross-layer map from the ABI, frontend source, and backend source. Shows contract methods with signatures, frontend components and their contract calls, backend routes and their contract calls, and auto-detects missing methods (called but not in ABI) and uncalled methods (in ABI but never referenced).
+
+#### Autoresearch Optimize Mode
+`/buidl-optimize gas` runs an automated optimization loop: hypothesize, implement, benchmark, keep/revert. Supports gas, bundle_size, test_time, and throughput metrics. Default 10 cycles. No test regressions allowed. Produces a summary and auto-creates a PR with kept changes.
+
 #### Dynamic Re-Planning
 When an agent fails after retry, the orchestrator queries `learning/patterns.yaml` for known fix patterns matching the failure category. If a match is found, it presents a 5th option ("Apply known fix: [description]") alongside the standard 4 error-handling options. Lessons from past sessions are applied automatically instead of requiring manual intervention.
 
@@ -301,6 +317,7 @@ The auditor and reviewer check for 27 confirmed vulnerability patterns extracted
         |  checkpoint after each agent + cost log
         v
    Phase 5: REVIEW
+   Mutation gate (>= 70% required) + 4-dim score card
    Reviewer checks PR against spec + 27 patterns
         |  checkpoint
         v
@@ -317,9 +334,9 @@ The auditor and reviewer check for 27 confirmed vulnerability patterns extracted
 ```
 buidl/
 +-- .claude-plugin/
-|   +-- plugin.json              # Plugin manifest (v6.0.0)
+|   +-- plugin.json              # Plugin manifest (v7.0.0)
 +-- agents/                      # 14 agent definitions (incl. adversarial auditor + tester)
-+-- commands/                    # 10 slash commands (incl. buidl-trace, buidl-learning)
++-- commands/                    # 11 slash commands (incl. buidl-optimize)
 +-- hooks/                       # Stop hook + state guards
 |   +-- scripts/
 +-- knowledge/                   # OPNet reference + domain slices
@@ -328,14 +345,14 @@ buidl/
 |   +-- patterns.yaml            # Structured pattern store (auto-updated)
 |   +-- agent-scores.yaml        # Agent performance metrics (auto-updated)
 |   +-- profiles/                # Auto-generated project-type profiles
-+-- scripts/                     # Setup + state writer + learning + routing + tracing + fuzz + knowledge scripts
++-- scripts/                     # Setup + state + learning + routing + tracing + fuzz + mutation + scoring + repo-map scripts
 +-- skills/                      # 3 triggerable skills
 |   +-- audit-from-bugs/
 |   +-- loop-guide/
 |   +-- pua/
 +-- templates/                   # Domain agent, knowledge slice, starter templates
 |   +-- starters/                # Project scaffolds (op20-token, more planned)
-+-- tests/                       # 419 structural + functional + integration tests
++-- tests/                       # 450+ structural + functional + integration tests
 ```
 
 ## Testing
@@ -344,14 +361,17 @@ buidl/
 bash tests/plugin-tests.sh
 ```
 
-434+ tests across 53 categories covering shell syntax, agent structure, FORBIDDEN blocks, knowledge references, issue bus schema, version consistency, state guards, resume logic, learning system, templates, cost tracking, wall-clock timeout, max_turns, integration tests, transaction simulation, Playwright E2E, adaptive learning, cross-layer validation, starter templates, score-based routing, project-type profiles, cross-agent critique, incremental audit, dry-run mode, agent tracing, dynamic re-planning, acceptance test locking, ABI-lock, adversarial auditing, adversarial E2E testing, failure diagnosis, findings ledger, chain probe, hard gate enforcement, and regression tracking.
+450+ tests across 57 categories covering shell syntax, agent structure, FORBIDDEN blocks, knowledge references, issue bus schema, version consistency, state guards, resume logic, learning system, templates, cost tracking, wall-clock timeout, max_turns, integration tests, transaction simulation, Playwright E2E, adaptive learning, cross-layer validation, starter templates, score-based routing, project-type profiles, cross-agent critique, incremental audit, dry-run mode, agent tracing, dynamic re-planning, acceptance test locking, ABI-lock, adversarial auditing, adversarial E2E testing, failure diagnosis, findings ledger, chain probe, hard gate enforcement, regression tracking, mutation testing, structured repair phases, goal-oriented evaluation, repo map, and autoresearch optimize.
 
 Tests run automatically on every push and PR via GitHub Actions.
 
 ---
 
 ## Version History
 
+### v7.0.0 — Mutation + Repair + Scoring (2026-03-13)
+Four verification and repair improvements: **Mutation testing gate** applies 20 operators to contract source, requiring >= 70% kill rate before review. **Structured repair phases** (R1/R2/R3) localize failures, generate targeted patches, and validate automatically. **Goal-oriented build evaluation** scores builds across 4 dimensions (spec coverage, security, mutation, code health) with routing for each failed dimension. **Hierarchical repo map** provides cross-layer visibility from ABI to frontend/backend calls. Plus **autoresearch optimize mode** for automated metric improvement.
+
 ### v6.0.0 — Dynamic Knowledge (2026-03-13)
 Three knowledge and learning system improvements: **Dynamic knowledge slice loading** assembles role-specific knowledge payloads per agent, filtering the 2000-line bible to only role-relevant sections and keeping payloads under 400 lines. **Property-based fuzz case generation** creates structured boundary test cases from ABI signatures for adversarial auditing. **Stale pattern pruning** with version-based staleness tracking and a `/buidl-learning` health report.
 

diff --git a/agents/cross-layer-validator.md b/agents/cross-layer-validator.md
@@ -47,6 +47,7 @@ You are the **Cross-Layer Validator** agent. You check integration correctness a
 Before any validation:
 1. Load your knowledge payload via `bash ${CLAUDE_PLUGIN_ROOT}/scripts/load-knowledge.sh cross-layer-validator <project-type>` — this assembles your domain slice (cross-layer-validation.md), troubleshooting guide, and learned patterns.
 2. If you encounter issues, check [knowledge/opnet-troubleshooting.md](knowledge/opnet-troubleshooting.md).
+3. If `artifacts/repo-map.md` exists, read it for cross-layer method mapping and integrity checks.
 
 ## Process
 

diff --git a/agents/loop-builder.md b/agents/loop-builder.md
@@ -153,6 +153,8 @@ Check in this order:
 - Use `deriveOPWallet()` not `derive()` for OPWallet-compatible keys
 - `Buffer` is gone — use `BufferHelper` from `@btc-vision/transaction`
 
+4. If `artifacts/repo-map.md` exists, read it for cross-layer context (contract methods, frontend components, backend routes, integrity checks).
+
 ---
 
 ## Step 0.5: Load PUA Methodology (MANDATORY)

diff --git a/agents/loop-explorer.md b/agents/loop-explorer.md
@@ -48,6 +48,7 @@ Before starting analysis, check if this is an OPNet project:
 1. Read `package.json` — look for `@btc-vision/*` or `opnet` in dependencies.
 2. Check for `asconfig.json` (contract project), `vite.config.ts` (frontend), or `@btc-vision/hyper-express` (backend).
 3. If OPNet detected: load knowledge via `bash ${CLAUDE_PLUGIN_ROOT}/scripts/load-knowledge.sh loop-explorer <project-type>` — this assembles the project-setup.md slice, troubleshooting guide, and learned patterns. This informs what patterns to look for.
+4. If `artifacts/repo-map.md` exists, read it for cross-layer context (contract methods, frontend components, backend routes, integrity checks).
 
 ## Process
 

diff --git a/agents/loop-researcher.md b/agents/loop-researcher.md
@@ -36,6 +36,8 @@ Before searching, read the feature description you were given carefully. Identif
 
 For Bitcoin/OPNet projects: prioritize searching the OPNet ecosystem first — btc-vision GitHub repos (github.com/btc-vision/*), OPNet docs, and existing OPNet dApps. Most OPNet patterns already have reference implementations (MotoSwap for DEX, NativeSwap for BTC-token swaps, etc.).
 
+4. If `artifacts/repo-map.md` exists, read it for cross-layer context (contract methods, frontend components, backend routes, integrity checks).
+
 ## Process
 
 ### Step 1: Search for Existing Solutions