A 627-line FastAPI proxy that implements Self-MoA (ICLR 2025) as an OpenAI-shape /v1/chat/completions endpoint. You point your chat frontend at moa-router; it fans every query to two upstream LLM endpoints (with different temperature/sampling), then a third pass through one of them synthesizes the two candidates verbatim into a single best answer.
Published Self-MoA lift: +3.8% – +6.6% on AlpacaEval / MATH / GSM8K vs the strongest single upstream, with the same model on both sides. The headline insight from the paper is that the ensemble doesn't need diverse models — just diverse samples from the same model.
POST /v1/chat/completions
│
▼
┌────────────────┐
│ moa-router │ :8056
└─┬───────────┬──┘
T=0.5 │ │ T=0.9
┌─────────────┘ └────────────┐
▼ ▼
upstream-A upstream-B
(any OpenAI-shape) (any OpenAI-shape)
│ │
└────────┬─────────────────┬───────────┘
▼ ▼
candidate 1 candidate 2
│ │
└────────┬────────┘
▼
┌───────────────┐
│ AGGREGATOR │ ← one more call to upstream-A
│ (Self-MoA │ with the togethercomputer/MoA
│ synthesis) │ verbatim-synthesis prompt
└──────┬────────┘
▼
single OpenAI response
streams back to client
Self-MoA papers exist. The togethercomputer/MoA reference exists. But there's no stateless, OpenAI-compatible proxy that drops into any frontend. Most implementations are notebooks or framework-coupled (LangGraph, AutoGen, etc).
moa-router is the boring proxy: stand it up between your chat and your LLM endpoints, and the lift lands in production without your frontend knowing anything changed.
docker run --rm -d \
--name moa-router \
-p 8056:8056 \
-e UPSTREAM_A_URL=http://your-llm-a:8000/v1/chat/completions \
-e UPSTREAM_B_URL=http://your-llm-b:8000/v1/chat/completions \
ghcr.io/karany97/moa-router:latestgit clone https://github.com/karany97/moa-router.git
cd moa-router
pip install -r requirements.txt
UPSTREAM_A_URL=http://your-llm-a:8000/v1/chat/completions \
UPSTREAM_B_URL=http://your-llm-b:8000/v1/chat/completions \
uvicorn app:app --host 0.0.0.0 --port 8056# Health
curl http://localhost:8056/health
# → {"ok":true,"upstream_a":"http://...","upstream_b":"http://..."}
# Chat (Self-MoA mode — default)
curl http://localhost:8056/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "moa-router",
"messages": [{"role":"user","content":"What is 17 * 24 + sqrt(144)?"}]
}'
# Bypass mode — single upstream only (for A/B comparison)
curl 'http://localhost:8056/v1/chat/completions?mode=single' -H ... -d '{...}'| Env var | Default | What it does |
|---|---|---|
UPSTREAM_A_URL |
http://upstream-a:8000/v1/chat/completions |
Primary LLM endpoint (also the aggregator) |
UPSTREAM_B_URL |
http://upstream-b:8000/v1/chat/completions |
Secondary LLM endpoint (provides candidate B) |
UPSTREAM_API_KEY |
(empty) |
Bearer token forwarded to both upstreams |
MOA_TEMP_A |
0.5 |
Temperature for candidate A |
MOA_TEMP_B |
0.9 |
Temperature for candidate B |
MOA_AGGREGATOR_TEMP |
0.3 |
Temperature for the synthesis pass |
MOA_TIMEOUT |
30 |
Per-upstream timeout (seconds) |
MOA_DEFAULT_MODEL |
moa-router |
The model name reported in responses |
PORT |
8056 |
moa-router's listen port |
| Mode | Trigger | What happens |
|---|---|---|
| moa | Default | Fan to A + B, aggregate via A |
| single | ?mode=single query param |
Direct passthrough to upstream A |
| best-of-2 (planned) | ?mode=bo2 |
Same as moa, but the aggregator picks the better candidate instead of synthesizing |
| race (planned) | ?mode=race |
First-to-respond wins, other is cancelled |
The single mode is critical for A/B benchmarking — you can hit the same endpoint with and without ?mode=single and compare quality + latency on identical inputs.
| Surface | Status |
|---|---|
POST /v1/chat/completions (non-stream) |
✅ |
POST /v1/chat/completions (stream — aggregator streams; candidates collected first) |
✅ |
GET /health |
✅ |
| Multi-turn conversations | ✅ |
| Tool calls (passthrough) | ✅ (no aggregation logic specific to tools yet — both upstreams must return identical tool calls or one wins) |
| Vision inputs | ✅ if both upstreams support them |
Streaming with usage.cost |
✅ |
- Auth. Put a reverse-proxy in front. Same pattern as our tooltalk repo.
- Rate-limiting. Same.
- Per-user model routing. moa-router knows two upstreams, period. For per-user logic, point it at LiteLLM.
- Aggregator strategies beyond verbatim synthesis. The paper's verbatim-synthesis prompt is the default and the only first-class strategy. A best-of-N variant is on the roadmap.
| TTFT | Total latency | Quality (vs single A) | |
|---|---|---|---|
mode=single (passthrough) |
upstream-A TTFT | upstream-A latency | baseline |
mode=moa (default) |
aggregator TTFT (≈ 2× single) | ≈ 2.5× single | +3.8% – +6.6% on reasoning (per the Self-MoA paper) |
The latency tax is real: you're making 3 LLM calls per user turn instead of 1. Use moa-router on the hard queries (math, multi-step reasoning, code), not on every conversational message. Most chat frontends can route based on a temperature: 0.7+ threshold or an explicit "think harder" toggle.
✅ Good fit:
- You have two LLM endpoints (could be the same model behind two ports)
- You want measurable quality lift on reasoning queries
- You can tolerate 2-3× latency on the queries you route through it
- You want the lift without rewriting your chat client
❌ Bad fit:
- You only have one LLM endpoint
- Your queries are all conversational (no reasoning/math/code burden — the lift mostly disappears)
- Your latency budget is < 2× the single-call baseline
pip install -r requirements-dev.txt
pytest tests/ # 16 tests, ~0.3 s using recorded LLM fixtures| You're using | Plug moa-router in like |
|---|---|
| tooltalk + Gemma 4 | tooltalk → moa-router → 2× tooltalks → 2× Gemma 4 ports |
| Destiny Atelier | Settings → Connection → set baseUrl to http://moa-router:8056/v1 |
| Open WebUI / LibreChat / Lobe Chat | OpenAI endpoint = http://moa-router:8056/v1 |
| LiteLLM | Add a model entry pointing api_base: http://moa-router:8056/v1 |
MIT. Fork it, sell it, ship it.
- The Self-MoA paper authors — Junlin Wang, Jue Wang, Yongming Liu, Vasilis Syrgkanis (ICLR 2025)
- togethercomputer/MoA — reference implementation whose verbatim-synthesis aggregator prompt we ship
- The Destiny Atelier sprint that ironed out the prod-grade streaming / cancellation / timeout handling