Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions custom_routers/fusion_gate/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Compiled Python artifacts must not be tracked. These are build output, not
# source, and were committed by mistake. To purge ones already tracked:
# git rm -r --cached custom_routers/fusion_gate/**/__pycache__
__pycache__/
*.pyc
*.pyo

# Eval harness runtime output. The harness writes results.csv / results.md here
# on every run; this is build output, not source, and must never be tracked. The
# committed, intentional report lives at eval/RESULTS.md instead.
eval/out/

# The repo root .gitignore ignores *.jsonl globally. Re-include the committed
# eval fixtures, which are source (the offline --mock harness depends on them).
!eval/fixtures/*.jsonl
98 changes: 98 additions & 0 deletions custom_routers/fusion_gate/PR_BODY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Add FusionGateRouter — a route-vs-fuse meta-router

## Summary

Adds `FusionGateRouter`, a self-contained custom router plugin under
`custom_routers/fusion_gate/` that gates each query between the cheap
single-model path and a multi-model **fusion** path, with fusion delegated to
OpenRouter's `openrouter:fusion` server tool. **Zero edits to core `llmrouter/`
code** — the plugin is auto-discovered via the existing `custom_routers/`
mechanism, exactly like `randomrouter` and `thresholdrouter`.

## Motivation

LLMRouter today picks *which single model* answers a query. The interesting
lever for hard queries is a different one: **route vs. fuse** — decide whether a
query is worth running a panel of models and synthesizing their answers. This PR
makes route-vs-fuse the **primary per-query dial**, expressed as a three-tier
escalation driven by estimated difficulty:

```
single -> budget_fusion (cheap panel) -> fusion (full Quality panel)
```

Cheap queries stay cheap; only the hard ones escalate, and the middle tier lets
mid-difficulty queries fuse on a budget panel instead of jumping straight to the
full Quality panel.

## What's included

**In scope:**
- `FusionGateRouter` — the route-vs-fuse gate (difficulty + confidence) plus capability-scored panel selection with a Quality/Budget preset fallback.
- An `openrouter:fusion` adapter (`executor.py`) — the single, isolated blast point for the beta server-tool API.
- A configurable surface (`threshold`, `k`, `judge`, `provider`/`base_url`, `panel_preset`, `cost_ceiling`, `est_completion_tokens`) and a `--route-only` spend-free preview that returns the decision + intended panel/judge without any API call.
- A per-query **dollar** cost guard (`cost_ceiling`) that downgrades fusion → single when the projected spend exceeds the cap.
- Secret-scrubbed fusion-call logging (`fusion_log.py`) producing FusionFactory-style `(query, model, response, performance)` training rows.
- A three-arm offline eval harness + bundled fixtures (`eval/`) and an offline retrain step.
- Self-contained: **ONE optional provider** (OpenRouter), **ZERO core edits**.

**Out of scope (follow-ups):**
- **Local fan-out fallback is OUT of this PR.** Without an OpenRouter key only `--route-only` is exercisable. The executor interface is the seam a provider-agnostic local fan-out path would slot behind later — happy to add it if maintainers want it.
- A learned gate (the gate currently uses a duck-typed difficulty estimator with a deterministic lexical fallback so it runs with no trained model).

## Eval results

> **All committed numbers are from MOCK fixtures** (deterministic stub executor,
> zero spend, no network). They validate harness wiring and metric math, **not**
> real model quality. **Real numbers require a keyed live run**
> (`OPENROUTER_API_KEY` / `API_KEYS` set) against a real benchmark slice — that
> path is documented but intentionally not wired into the offline harness so a
> stray run cannot spend. See `eval/RESULTS.md`.

Dataset: 16 held-out queries (6 easy + 10 hard; GSM8K / MATH / GPQA / MBPP).
Quality / blended cost / escalation `p` are over the full 16-query dataset; **gate
precision is computed over the same fixed 10-query hard slice for every arm** so the
arms are comparable (`always_route` makes no escalation decision → N/A). Slice
definitions are documented in `eval/RESULTS.md`. Blended cost is an estimated
**per-query dollar** amount.

| Arm | n | Quality | Blended cost ($/query) | Escalation p | Gate-precision (hard slice) |
|-----|---|---------|------------------------|--------------|------------------------------|
| always_route | 16 | 0.3750 | 0.000650 | 0.0000 | n/a |
| always_fuse | 16 | 1.0000 | 0.001137 | 1.0000 | 1.0000 |
| fusion_gate | 16 | 1.0000 | 0.000767 | 0.6250 | 1.0000 |

- **Quality target** — gate ≥ 95% of always-fuse quality: 1.0000 vs target 0.9500 → **PASS** (mock).
- **Cost target** — blended cost ≤ 1.6× always-route: ratio 1.18 → **PASS** (mock).
- **Gate precision** — escalated answers beating best single, over the hard slice: fusion_gate 10/10, always_fuse 10/10 → **measured** (mock).
- **Retrain delta** — offline log→retrain holds gate-precision at 1.0000 (threshold refit 0.400 → 0.520, budget_threshold 0.100 → 0.180). **Real delta pending a keyed live run.**

## FusionFactory & continual learning

Each fusion call yields a panel of per-model responses plus a judge synthesis —
exactly the `(query, model, response, performance)` observations FusionFactory
needs. `fusion_log.to_training_rows` decomposes them into rows shaped for
`llmrouter/data/api_calling_evaluation.py`, and the retrain step replays the
logged sink to refit the gate thresholds offline. This directly serves the
repo's **continual-learning TODO**: the router's own fusion traffic becomes the
training signal that sharpens the route-vs-fuse gate over time, with no separate
labeling pass required.

## Beta server-tool caveat

`openrouter:fusion` is an OpenRouter **BETA** server tool; its request/response
shape may change. All OpenRouter HTTP specifics are confined to `executor.py`
(request body, tool type, key resolution, transport, payload parsing), so an
upstream beta change touches one file. The executor degrades gracefully on judge
failure (synthesizes from panel responses). No API keys, auth headers, or raw
provider payloads are ever logged.

## Testing

Torch-free, fully offline (HTTP mocked):

```bash
pytest custom_routers/fusion_gate/tests/
python -m custom_routers.fusion_gate.eval.eval_harness --mock --with-retrain \
--out custom_routers/fusion_gate/eval/out
```
176 changes: 176 additions & 0 deletions custom_routers/fusion_gate/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# FusionGateRouter

**Type:** Meta-router (route-vs-fuse gate). No training required to run; an optional offline retrain step refits the gate from logged fusion calls.

**Description:** A per-query gate that decides between the cheap **single-model**
path (classic LLMRouter routing) and a **fusion** path that runs a panel of
models and synthesizes their answers. Fusion is delegated to the OpenRouter
`openrouter:fusion` server tool (BETA — see the caveat below). Routing is
spend-free: the decision is computed locally and only `fuse()` ever calls the
provider.

The primary per-query dial is **route vs. fuse**, expressed as three tiers:

```
difficulty < budget_threshold -> single (cheapest single model)
budget_threshold <= difficulty < threshold -> budget_fusion (cheap Budget panel)
difficulty >= threshold -> fusion (full Quality panel)
```

Set `budget_threshold: null` (or `>= threshold`) to disable the middle tier and
collapse to plain single/fusion. A `high_stakes: true` flag on a query forces
the full Quality `fusion` tier regardless of difficulty.

## Usage

```bash
# Inference (routes, then fuses via openrouter:fusion if the gate escalates)
llmrouter infer --router fusion_gate \
--config custom_routers/fusion_gate/config.yaml \
--query "Prove that the square root of 2 is irrational."

# Route-only — compute the decision with ZERO spend / no network call
llmrouter infer --router fusion_gate \
--config custom_routers/fusion_gate/config.yaml \
--query "What is the capital of France?" \
--route-only
```

`--route-only` returns the decision dict (tier, panel, judge, projected cost)
without ever calling OpenRouter. Spend happens only when `fuse()` is invoked.

## Decision contract

`route_single` returns one of two shapes (both carry `strategy`, `tier`, and
`model_name` for drop-in CLI compatibility):

- **single:** `{query, strategy="single", tier="single", model_name, predicted_llm, difficulty, confidence}`
- **fusion:** `{query, strategy="fusion", tier="budget_fusion"|"fusion", panel[], judge, model_name, predicted_llm, difficulty, confidence, projected_cost}`

When the cost guard fires, a fusion decision is **downgraded** to single and the
result carries `downgraded_from`, `projected_cost`, and `cost_ceiling`.

## Configuration

All keys live under `hparam:` in `config.yaml` unless noted.

| Key | Default | Purpose |
|-----|---------|---------|
| `threshold` | `0.5` | Difficulty cutoff to escalate to the full Quality `fusion` tier. |
| `budget_threshold` | `0.3` | Lower boundary of the middle `budget_fusion` tier. `null` (or `>= threshold`) disables it. |
| `k` | `3` | Panel size — maps to the tool's `analysis_models`. |
| `judge` | `null` | Judge model slug — maps to the tool's `model`. `null` = use the outer model. |
| `panel_preset` | `Quality` | Fallback preset (`Quality` / `Budget`) when capability data is unavailable for a query. |
| `cost_ceiling` | `null` | Hard per-query **dollar** cap on the projected `Σ(panel)+judge` cost. `null` = off. See the cost-unit note. |
| `est_completion_tokens` | `512` | Per-completion output-token estimate feeding the dollar cost projection. |
| `provider` | `OpenRouter` | Informational; drives credential resolution. |
| `base_url` | `https://openrouter.ai/api/v1` | OpenRouter endpoint hosting the beta server tool. Overrides the top-level `api_endpoint`. |
| `log_sink_path` | `null` | JSONL sink for fusion-call logging. `null` = `fusion_log` default (`~/.llmrouter/openclaw_memory.jsonl`). |

Top-level `data_path` / `metric` keys mirror the other custom routers
(`randomrouter`, `thresholdrouter`); see `config.yaml` for the loaded candidate
and routing-data paths.

### Cost-unit note (important)

`cost_ceiling` is compared against `project_cost`, which estimates the **per-query
dollar cost** of the panel + judge. For each member,
`(input_price · prompt_tokens + output_price · completion_tokens) / 1e6`, where
`input_price` / `output_price` are the per-million-token prices from `llm_data`,
`prompt_tokens ≈ len(query) // 4`, and `completion_tokens = est_completion_tokens`
(default `512`). Set `cost_ceiling` in **dollars per query** (e.g. `0.05` ≈ five
cents per query).

## Panel selection

Panels are chosen by `CapabilityScorer`, which scores candidates per **query
category** (code / math / reasoning / general) from the LLMRouter routing-data
tables, lightly cost-penalized. When no usable capability data exists for a
query's category, selection falls back to a preset panel resolved by tier:
`budget_fusion` -> `Budget`, anything else -> the configured `panel_preset`
(`Quality` by default). The tier->preset mapping (`gate.resolve_preset`) is the
single source of truth shared with the eval harness.

## OpenRouter `openrouter:fusion` — BETA caveat

The fusion path depends on OpenRouter's `openrouter:fusion` **server tool, which
is BETA**: its request/response shape may change without notice. To contain that
risk, **every OpenRouter HTTP specific lives in `executor.py` and nowhere else**
— request body construction, the `openrouter:fusion` tool type, key resolution,
transport, and payload parsing. An upstream beta change should touch that one
file only. The executor also tolerates judge failure (status `ok` with
`analysis` omitted): it synthesizes the answer from the panel responses rather
than crashing.

OpenRouter is the **one optional provider**. There is no local fan-out fallback
(deferred to a follow-up); without a key, only `--route-only` is exercisable.

## Logging

Every `fuse()` call is appended (best-effort, append-only) to the JSONL sink via
`fusion_log.log_fusion`. The sink is **secret-scrubbed**: API keys, auth
headers, cookies, and the untouched provider payload are never written; only an
enumerated set of fields (query, panel, judge, normalized responses, analysis,
token/cost) is emitted. These rows are the FusionFactory-style training signal
consumed by the offline retrain step.

## Offline evaluation (`--mock`, zero spend)

The three-arm harness compares `always_route`, `always_fuse`, and `fusion_gate`
over a bundled hard-query slice (GSM8K / MATH / GPQA / MBPP). It is **offline by
default** — a deterministic stub executor reads canned answers from fixtures; no
network call is made and nothing is spent.

```bash
# Run the offline harness (mock is the default)
python -m custom_routers.fusion_gate.eval.eval_harness --mock \
--out custom_routers/fusion_gate/eval/out

# Include the mock retrain (M3 before/after) delta in results.md
python -m custom_routers.fusion_gate.eval.eval_harness --mock --with-retrain \
--out custom_routers/fusion_gate/eval/out
```

Tunable flags: `--threshold` (0.5), `--budget-threshold` (0.3), `--k` (2 in the
harness — kept cost-bounded for the M2 target; the plugin config uses `k=3`),
`--judge`, `--panel-preset`, `--dataset`, `--llm`, `--routing`, `--out`.
Outputs: `<out>/results.csv` and `<out>/results.md` (the `--out` dir defaults to
`eval/out/`, which is **gitignored** — runtime output, not source). The committed,
intentional report lives at [`eval/RESULTS.md`](eval/RESULTS.md), which also documents
the full-dataset vs hard-slice definitions used by the metrics.

`--live` is intentionally **not** wired into this harness, so a stray run cannot
spend; passing it errors out with a pointer to the keyed live-run path.

Run the unit tests (torch-free, fully offline, HTTP mocked):

```bash
pytest custom_routers/fusion_gate/tests/
```

## Live run (keyed, real spend)

The committed eval numbers are from MOCK fixtures. To produce real M1–M4 numbers
you must run keyed against real models:

```bash
# Provide an OpenRouter key (never commit it):
export OPENROUTER_API_KEY=sk-... # or: export API_KEYS='{"OpenRouter": "sk-..."}'

# Then build the real FusionGateRouter from config.yaml and route+fuse a real
# benchmark slice; the executor makes the openrouter:fusion calls. The offline
# eval harness does NOT make live calls by design — see eval/RESULTS.md.
```

Keys are resolved (in order) from an explicit `api_keys={"OpenRouter": "..."}`
dict, `OPENROUTER_API_KEY`, or an `API_KEYS` JSON env var. Keys are never logged.

## Files

- `router.py` — `FusionGateRouter` entry point (MetaRouter contract).
- `gate.py` — `RouteGate`, `GateDecision`, the three-tier dial, `resolve_preset`.
- `capability.py` — `CapabilityScorer` panel selection.
- `executor.py` — **the only** OpenRouter `openrouter:fusion` blast point.
- `fusion_log.py` — secret-scrubbed JSONL logging + training-row decomposition.
- `eval/` — three-arm offline harness, fixtures, retrain, and `RESULTS.md` (the committed report; `eval/out/` is gitignored runtime output).
- `tests/` — torch-free offline unit tests.
34 changes: 34 additions & 0 deletions custom_routers/fusion_gate/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
"""fusion_gate — route-vs-fuse meta-router plugin for LLMRouter.

Auto-discovered from ./custom_routers/ . See router.py for the entry point.

``FusionGateRouter`` is imported LAZILY (PEP 562 ``__getattr__``) rather than
eagerly: ``router.py`` pulls in torch (MetaRouter subclasses ``nn.Module``), and
an eager import here would force torch to load whenever this package is merely
*resolved* — which pytest does for every test module under ``tests/`` while
walking the package hierarchy. That made the four torch-free test modules
uncollectable under the standard ``pytest custom_routers/fusion_gate/tests/``
invocation (ModuleNotFoundError: No module named 'torch'). Deferring the import
to first attribute access keeps package resolution torch-free while still
exposing ``FusionGateRouter`` as a top-level name when it is actually used.
"""

from typing import TYPE_CHECKING, Any

if TYPE_CHECKING: # import for type-checkers only; not executed at runtime
from .router import FusionGateRouter

__all__ = ["FusionGateRouter"]


def __getattr__(name: str) -> Any:
"""Lazily import ``FusionGateRouter`` on first access (PEP 562).

torch (a transitive dependency of ``router.py``) is loaded only when the
router is actually requested, not at package-collection time.
"""
if name == "FusionGateRouter":
from .router import FusionGateRouter

return FusionGateRouter
raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
Loading