ulab-uiuc · ConsultingFuture4200 · Jun 15, 2026
diff --git a/custom_routers/fusion_gate/.gitignore b/custom_routers/fusion_gate/.gitignore
@@ -0,0 +1,15 @@
+# Compiled Python artifacts must not be tracked. These are build output, not
+# source, and were committed by mistake. To purge ones already tracked:
+#   git rm -r --cached custom_routers/fusion_gate/**/__pycache__
+__pycache__/
+*.pyc
+*.pyo
+
+# Eval harness runtime output. The harness writes results.csv / results.md here
+# on every run; this is build output, not source, and must never be tracked. The
+# committed, intentional report lives at eval/RESULTS.md instead.
+eval/out/
+
+# The repo root .gitignore ignores *.jsonl globally. Re-include the committed
+# eval fixtures, which are source (the offline --mock harness depends on them).
+!eval/fixtures/*.jsonl
diff --git a/custom_routers/fusion_gate/PR_BODY.md b/custom_routers/fusion_gate/PR_BODY.md
@@ -0,0 +1,98 @@
+# Add FusionGateRouter — a route-vs-fuse meta-router
+
+## Summary
+
+Adds `FusionGateRouter`, a self-contained custom router plugin under
+`custom_routers/fusion_gate/` that gates each query between the cheap
+single-model path and a multi-model **fusion** path, with fusion delegated to
+OpenRouter's `openrouter:fusion` server tool. **Zero edits to core `llmrouter/`
+code** — the plugin is auto-discovered via the existing `custom_routers/`
+mechanism, exactly like `randomrouter` and `thresholdrouter`.
+
+## Motivation
+
+LLMRouter today picks *which single model* answers a query. The interesting
+lever for hard queries is a different one: **route vs. fuse** — decide whether a
+query is worth running a panel of models and synthesizing their answers. This PR
+makes route-vs-fuse the **primary per-query dial**, expressed as a three-tier
+escalation driven by estimated difficulty:
+
+```
+single  ->  budget_fusion (cheap panel)  ->  fusion (full Quality panel)
+```
+
+Cheap queries stay cheap; only the hard ones escalate, and the middle tier lets
+mid-difficulty queries fuse on a budget panel instead of jumping straight to the
+full Quality panel.
+
+## What's included
+
+**In scope:**
+- `FusionGateRouter` — the route-vs-fuse gate (difficulty + confidence) plus capability-scored panel selection with a Quality/Budget preset fallback.
+- An `openrouter:fusion` adapter (`executor.py`) — the single, isolated blast point for the beta server-tool API.
+- A configurable surface (`threshold`, `k`, `judge`, `provider`/`base_url`, `panel_preset`, `cost_ceiling`, `est_completion_tokens`) and a `--route-only` spend-free preview that returns the decision + intended panel/judge without any API call.
+- A per-query **dollar** cost guard (`cost_ceiling`) that downgrades fusion → single when the projected spend exceeds the cap.
+- Secret-scrubbed fusion-call logging (`fusion_log.py`) producing FusionFactory-style `(query, model, response, performance)` training rows.
+- A three-arm offline eval harness + bundled fixtures (`eval/`) and an offline retrain step.
+- Self-contained: **ONE optional provider** (OpenRouter), **ZERO core edits**.
+
+**Out of scope (follow-ups):**
+- **Local fan-out fallback is OUT of this PR.** Without an OpenRouter key only `--route-only` is exercisable. The executor interface is the seam a provider-agnostic local fan-out path would slot behind later — happy to add it if maintainers want it.
+- A learned gate (the gate currently uses a duck-typed difficulty estimator with a deterministic lexical fallback so it runs with no trained model).
+
+## Eval results
+
+> **All committed numbers are from MOCK fixtures** (deterministic stub executor,
+> zero spend, no network). They validate harness wiring and metric math, **not**
+> real model quality. **Real numbers require a keyed live run**
+> (`OPENROUTER_API_KEY` / `API_KEYS` set) against a real benchmark slice — that
+> path is documented but intentionally not wired into the offline harness so a
+> stray run cannot spend. See `eval/RESULTS.md`.
+
+Dataset: 16 held-out queries (6 easy + 10 hard; GSM8K / MATH / GPQA / MBPP).
+Quality / blended cost / escalation `p` are over the full 16-query dataset; **gate
+precision is computed over the same fixed 10-query hard slice for every arm** so the
+arms are comparable (`always_route` makes no escalation decision → N/A). Slice
+definitions are documented in `eval/RESULTS.md`. Blended cost is an estimated
+**per-query dollar** amount.
+
+| Arm | n | Quality | Blended cost ($/query) | Escalation p | Gate-precision (hard slice) |
+|-----|---|---------|------------------------|--------------|------------------------------|
+| always_route | 16 | 0.3750 | 0.000650 | 0.0000 | n/a |
+| always_fuse | 16 | 1.0000 | 0.001137 | 1.0000 | 1.0000 |
+| fusion_gate | 16 | 1.0000 | 0.000767 | 0.6250 | 1.0000 |
+
+- **Quality target** — gate ≥ 95% of always-fuse quality: 1.0000 vs target 0.9500 → **PASS** (mock).
+- **Cost target** — blended cost ≤ 1.6× always-route: ratio 1.18 → **PASS** (mock).
+- **Gate precision** — escalated answers beating best single, over the hard slice: fusion_gate 10/10, always_fuse 10/10 → **measured** (mock).
+- **Retrain delta** — offline log→retrain holds gate-precision at 1.0000 (threshold refit 0.400 → 0.520, budget_threshold 0.100 → 0.180). **Real delta pending a keyed live run.**
+
+## FusionFactory & continual learning
+
+Each fusion call yields a panel of per-model responses plus a judge synthesis —
+exactly the `(query, model, response, performance)` observations FusionFactory
+needs. `fusion_log.to_training_rows` decomposes them into rows shaped for
+`llmrouter/data/api_calling_evaluation.py`, and the retrain step replays the
+logged sink to refit the gate thresholds offline. This directly serves the
+repo's **continual-learning TODO**: the router's own fusion traffic becomes the
+training signal that sharpens the route-vs-fuse gate over time, with no separate
+labeling pass required.
+
+## Beta server-tool caveat
+
+`openrouter:fusion` is an OpenRouter **BETA** server tool; its request/response
+shape may change. All OpenRouter HTTP specifics are confined to `executor.py`
+(request body, tool type, key resolution, transport, payload parsing), so an
+upstream beta change touches one file. The executor degrades gracefully on judge
+failure (synthesizes from panel responses). No API keys, auth headers, or raw
+provider payloads are ever logged.
+
+## Testing
+
+Torch-free, fully offline (HTTP mocked):
+
+```bash
+pytest custom_routers/fusion_gate/tests/
+python -m custom_routers.fusion_gate.eval.eval_harness --mock --with-retrain \
+  --out custom_routers/fusion_gate/eval/out
+```
diff --git a/custom_routers/fusion_gate/README.md b/custom_routers/fusion_gate/README.md
@@ -0,0 +1,176 @@
+# FusionGateRouter
+
+**Type:** Meta-router (route-vs-fuse gate). No training required to run; an optional offline retrain step refits the gate from logged fusion calls.
+
+**Description:** A per-query gate that decides between the cheap **single-model**
+path (classic LLMRouter routing) and a **fusion** path that runs a panel of
+models and synthesizes their answers. Fusion is delegated to the OpenRouter
+`openrouter:fusion` server tool (BETA — see the caveat below). Routing is
+spend-free: the decision is computed locally and only `fuse()` ever calls the
+provider.
+
+The primary per-query dial is **route vs. fuse**, expressed as three tiers:
+
+```
+difficulty < budget_threshold          ->  single         (cheapest single model)
+budget_threshold <= difficulty < threshold  ->  budget_fusion  (cheap Budget panel)
+difficulty >= threshold                ->  fusion         (full Quality panel)
+```
+
+Set `budget_threshold: null` (or `>= threshold`) to disable the middle tier and
+collapse to plain single/fusion. A `high_stakes: true` flag on a query forces
+the full Quality `fusion` tier regardless of difficulty.
+
+## Usage
+
+```bash
+# Inference (routes, then fuses via openrouter:fusion if the gate escalates)
+llmrouter infer --router fusion_gate \
+  --config custom_routers/fusion_gate/config.yaml \
+  --query "Prove that the square root of 2 is irrational."
+
+# Route-only — compute the decision with ZERO spend / no network call
+llmrouter infer --router fusion_gate \
+  --config custom_routers/fusion_gate/config.yaml \
+  --query "What is the capital of France?" \
+  --route-only
+```
+
+`--route-only` returns the decision dict (tier, panel, judge, projected cost)
+without ever calling OpenRouter. Spend happens only when `fuse()` is invoked.
+
+## Decision contract
+
+`route_single` returns one of two shapes (both carry `strategy`, `tier`, and
+`model_name` for drop-in CLI compatibility):
+
+- **single:** `{query, strategy="single", tier="single", model_name, predicted_llm, difficulty, confidence}`
+- **fusion:** `{query, strategy="fusion", tier="budget_fusion"|"fusion", panel[], judge, model_name, predicted_llm, difficulty, confidence, projected_cost}`
+
+When the cost guard fires, a fusion decision is **downgraded** to single and the
+result carries `downgraded_from`, `projected_cost`, and `cost_ceiling`.
+
+## Configuration
+
+All keys live under `hparam:` in `config.yaml` unless noted.
+
+| Key | Default | Purpose |
+|-----|---------|---------|
+| `threshold` | `0.5` | Difficulty cutoff to escalate to the full Quality `fusion` tier. |
+| `budget_threshold` | `0.3` | Lower boundary of the middle `budget_fusion` tier. `null` (or `>= threshold`) disables it. |
+| `k` | `3` | Panel size — maps to the tool's `analysis_models`. |
+| `judge` | `null` | Judge model slug — maps to the tool's `model`. `null` = use the outer model. |
+| `panel_preset` | `Quality` | Fallback preset (`Quality` / `Budget`) when capability data is unavailable for a query. |
+| `cost_ceiling` | `null` | Hard per-query **dollar** cap on the projected `Σ(panel)+judge` cost. `null` = off. See the cost-unit note. |
+| `est_completion_tokens` | `512` | Per-completion output-token estimate feeding the dollar cost projection. |
+| `provider` | `OpenRouter` | Informational; drives credential resolution. |
+| `base_url` | `https://openrouter.ai/api/v1` | OpenRouter endpoint hosting the beta server tool. Overrides the top-level `api_endpoint`. |
+| `log_sink_path` | `null` | JSONL sink for fusion-call logging. `null` = `fusion_log` default (`~/.llmrouter/openclaw_memory.jsonl`). |
+
+Top-level `data_path` / `metric` keys mirror the other custom routers
+(`randomrouter`, `thresholdrouter`); see `config.yaml` for the loaded candidate
+and routing-data paths.
+
+### Cost-unit note (important)
+
+`cost_ceiling` is compared against `project_cost`, which estimates the **per-query
+dollar cost** of the panel + judge. For each member,
+`(input_price · prompt_tokens + output_price · completion_tokens) / 1e6`, where
+`input_price` / `output_price` are the per-million-token prices from `llm_data`,
+`prompt_tokens ≈ len(query) // 4`, and `completion_tokens = est_completion_tokens`
+(default `512`). Set `cost_ceiling` in **dollars per query** (e.g. `0.05` ≈ five
+cents per query).
+
+## Panel selection
+
+Panels are chosen by `CapabilityScorer`, which scores candidates per **query
+category** (code / math / reasoning / general) from the LLMRouter routing-data
+tables, lightly cost-penalized. When no usable capability data exists for a
+query's category, selection falls back to a preset panel resolved by tier:
+`budget_fusion` -> `Budget`, anything else -> the configured `panel_preset`
+(`Quality` by default). The tier->preset mapping (`gate.resolve_preset`) is the
+single source of truth shared with the eval harness.
+
+## OpenRouter `openrouter:fusion` — BETA caveat
+
+The fusion path depends on OpenRouter's `openrouter:fusion` **server tool, which
+is BETA**: its request/response shape may change without notice. To contain that
+risk, **every OpenRouter HTTP specific lives in `executor.py` and nowhere else**
+— request body construction, the `openrouter:fusion` tool type, key resolution,
+transport, and payload parsing. An upstream beta change should touch that one
+file only. The executor also tolerates judge failure (status `ok` with
+`analysis` omitted): it synthesizes the answer from the panel responses rather
+than crashing.
+
+OpenRouter is the **one optional provider**. There is no local fan-out fallback
+(deferred to a follow-up); without a key, only `--route-only` is exercisable.
+
+## Logging
+
+Every `fuse()` call is appended (best-effort, append-only) to the JSONL sink via
+`fusion_log.log_fusion`. The sink is **secret-scrubbed**: API keys, auth
+headers, cookies, and the untouched provider payload are never written; only an
+enumerated set of fields (query, panel, judge, normalized responses, analysis,
+token/cost) is emitted. These rows are the FusionFactory-style training signal
+consumed by the offline retrain step.
+
+## Offline evaluation (`--mock`, zero spend)
+
+The three-arm harness compares `always_route`, `always_fuse`, and `fusion_gate`
+over a bundled hard-query slice (GSM8K / MATH / GPQA / MBPP). It is **offline by
+default** — a deterministic stub executor reads canned answers from fixtures; no
+network call is made and nothing is spent.
+
+```bash
+# Run the offline harness (mock is the default)
+python -m custom_routers.fusion_gate.eval.eval_harness --mock \
+  --out custom_routers/fusion_gate/eval/out
+
+# Include the mock retrain (M3 before/after) delta in results.md
+python -m custom_routers.fusion_gate.eval.eval_harness --mock --with-retrain \
+  --out custom_routers/fusion_gate/eval/out
+```
+
+Tunable flags: `--threshold` (0.5), `--budget-threshold` (0.3), `--k` (2 in the
+harness — kept cost-bounded for the M2 target; the plugin config uses `k=3`),
+`--judge`, `--panel-preset`, `--dataset`, `--llm`, `--routing`, `--out`.
+Outputs: `<out>/results.csv` and `<out>/results.md` (the `--out` dir defaults to
+`eval/out/`, which is **gitignored** — runtime output, not source). The committed,
+intentional report lives at [`eval/RESULTS.md`](eval/RESULTS.md), which also documents
+the full-dataset vs hard-slice definitions used by the metrics.
+
+`--live` is intentionally **not** wired into this harness, so a stray run cannot
+spend; passing it errors out with a pointer to the keyed live-run path.
+
+Run the unit tests (torch-free, fully offline, HTTP mocked):
+
+```bash
+pytest custom_routers/fusion_gate/tests/
+```
+
+## Live run (keyed, real spend)
+
+The committed eval numbers are from MOCK fixtures. To produce real M1–M4 numbers
+you must run keyed against real models:
+
+```bash
+# Provide an OpenRouter key (never commit it):
+export OPENROUTER_API_KEY=sk-...           # or: export API_KEYS='{"OpenRouter": "sk-..."}'
+
+# Then build the real FusionGateRouter from config.yaml and route+fuse a real
+# benchmark slice; the executor makes the openrouter:fusion calls. The offline
+# eval harness does NOT make live calls by design — see eval/RESULTS.md.
+```
+
+Keys are resolved (in order) from an explicit `api_keys={"OpenRouter": "..."}`
+dict, `OPENROUTER_API_KEY`, or an `API_KEYS` JSON env var. Keys are never logged.
+
+## Files
+
+- `router.py` — `FusionGateRouter` entry point (MetaRouter contract).
+- `gate.py` — `RouteGate`, `GateDecision`, the three-tier dial, `resolve_preset`.
+- `capability.py` — `CapabilityScorer` panel selection.
+- `executor.py` — **the only** OpenRouter `openrouter:fusion` blast point.
+- `fusion_log.py` — secret-scrubbed JSONL logging + training-row decomposition.
+- `eval/` — three-arm offline harness, fixtures, retrain, and `RESULTS.md` (the committed report; `eval/out/` is gitignored runtime output).
+- `tests/` — torch-free offline unit tests.
diff --git a/custom_routers/fusion_gate/__init__.py b/custom_routers/fusion_gate/__init__.py
@@ -0,0 +1,34 @@
+"""fusion_gate — route-vs-fuse meta-router plugin for LLMRouter.
+
+Auto-discovered from ./custom_routers/ . See router.py for the entry point.
+
+``FusionGateRouter`` is imported LAZILY (PEP 562 ``__getattr__``) rather than
+eagerly: ``router.py`` pulls in torch (MetaRouter subclasses ``nn.Module``), and
+an eager import here would force torch to load whenever this package is merely
+*resolved* — which pytest does for every test module under ``tests/`` while
+walking the package hierarchy. That made the four torch-free test modules
+uncollectable under the standard ``pytest custom_routers/fusion_gate/tests/``
+invocation (ModuleNotFoundError: No module named 'torch'). Deferring the import
+to first attribute access keeps package resolution torch-free while still
+exposing ``FusionGateRouter`` as a top-level name when it is actually used.
+"""
+
+from typing import TYPE_CHECKING, Any
+
+if TYPE_CHECKING:  # import for type-checkers only; not executed at runtime
+    from .router import FusionGateRouter
+
+__all__ = ["FusionGateRouter"]
+
+
+def __getattr__(name: str) -> Any:
+    """Lazily import ``FusionGateRouter`` on first access (PEP 562).
+
+    torch (a transitive dependency of ``router.py``) is loaded only when the
+    router is actually requested, not at package-collection time.
+    """
+    if name == "FusionGateRouter":
+        from .router import FusionGateRouter
+
+        return FusionGateRouter
+    raise AttributeError(f"module {__name__!r} has no attribute {name!r}")