moa-router

Self-MoA voting as a drop-in OpenAI-compatible proxy.
One endpoint in, two upstreams polled, one synthesized answer out.

What it is

A 627-line FastAPI proxy that implements Self-MoA (ICLR 2025) as an OpenAI-shape /v1/chat/completions endpoint. You point your chat frontend at moa-router; it fans every query to two upstream LLM endpoints (with different temperature/sampling), then a third pass through one of them synthesizes the two candidates verbatim into a single best answer.

Published Self-MoA lift: +3.8% – +6.6% on AlpacaEval / MATH / GSM8K vs the strongest single upstream, with the same model on both sides. The headline insight from the paper is that the ensemble doesn't need diverse models — just diverse samples from the same model.

              POST /v1/chat/completions
                          │
                          ▼
                  ┌────────────────┐
                  │   moa-router   │  :8056
                  └─┬───────────┬──┘
        T=0.5      │           │      T=0.9
    ┌─────────────┘           └────────────┐
    ▼                                      ▼
 upstream-A                            upstream-B
 (any OpenAI-shape)                    (any OpenAI-shape)
    │                                      │
    └────────┬─────────────────┬───────────┘
             ▼                 ▼
       candidate 1        candidate 2
             │                 │
             └────────┬────────┘
                      ▼
              ┌───────────────┐
              │  AGGREGATOR   │  ← one more call to upstream-A
              │  (Self-MoA    │     with the togethercomputer/MoA
              │   synthesis)  │     verbatim-synthesis prompt
              └──────┬────────┘
                     ▼
            single OpenAI response
            streams back to client

Why this exists

Self-MoA papers exist. The togethercomputer/MoA reference exists. But there's no stateless, OpenAI-compatible proxy that drops into any frontend. Most implementations are notebooks or framework-coupled (LangGraph, AutoGen, etc).

moa-router is the boring proxy: stand it up between your chat and your LLM endpoints, and the lift lands in production without your frontend knowing anything changed.

Install

Docker

docker run --rm -d \
  --name moa-router \
  -p 8056:8056 \
  -e UPSTREAM_A_URL=http://your-llm-a:8000/v1/chat/completions \
  -e UPSTREAM_B_URL=http://your-llm-b:8000/v1/chat/completions \
  ghcr.io/karany97/moa-router:latest

From source

git clone https://github.com/karany97/moa-router.git
cd moa-router
pip install -r requirements.txt
UPSTREAM_A_URL=http://your-llm-a:8000/v1/chat/completions \
UPSTREAM_B_URL=http://your-llm-b:8000/v1/chat/completions \
  uvicorn app:app --host 0.0.0.0 --port 8056

Verify

# Health
curl http://localhost:8056/health
# → {"ok":true,"upstream_a":"http://...","upstream_b":"http://..."}

# Chat (Self-MoA mode — default)
curl http://localhost:8056/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "moa-router",
  "messages": [{"role":"user","content":"What is 17 * 24 + sqrt(144)?"}]
}'

# Bypass mode — single upstream only (for A/B comparison)
curl 'http://localhost:8056/v1/chat/completions?mode=single' -H ... -d '{...}'

Configure

Env var	Default	What it does
`UPSTREAM_A_URL`	`http://upstream-a:8000/v1/chat/completions`	Primary LLM endpoint (also the aggregator)
`UPSTREAM_B_URL`	`http://upstream-b:8000/v1/chat/completions`	Secondary LLM endpoint (provides candidate B)
`UPSTREAM_API_KEY`	`(empty)`	Bearer token forwarded to both upstreams
`MOA_TEMP_A`	`0.5`	Temperature for candidate A
`MOA_TEMP_B`	`0.9`	Temperature for candidate B
`MOA_AGGREGATOR_TEMP`	`0.3`	Temperature for the synthesis pass
`MOA_TIMEOUT`	`30`	Per-upstream timeout (seconds)
`MOA_DEFAULT_MODEL`	`moa-router`	The model name reported in responses
`PORT`	`8056`	moa-router's listen port

Modes

Mode	Trigger	What happens
moa	Default	Fan to A + B, aggregate via A
single	`?mode=single` query param	Direct passthrough to upstream A
best-of-2 (planned)	`?mode=bo2`	Same as moa, but the aggregator picks the better candidate instead of synthesizing
race (planned)	`?mode=race`	First-to-respond wins, other is cancelled

The single mode is critical for A/B benchmarking — you can hit the same endpoint with and without ?mode=single and compare quality + latency on identical inputs.

What's supported

Surface	Status
`POST /v1/chat/completions` (non-stream)	✅
`POST /v1/chat/completions` (stream — aggregator streams; candidates collected first)	✅
`GET /health`	✅
Multi-turn conversations	✅
Tool calls (passthrough)	✅ (no aggregation logic specific to tools yet — both upstreams must return identical tool calls or one wins)
Vision inputs	✅ if both upstreams support them
Streaming with `usage.cost`	✅

What's NOT supported (deliberate)

Auth. Put a reverse-proxy in front. Same pattern as our tooltalk repo.
Rate-limiting. Same.
Per-user model routing. moa-router knows two upstreams, period. For per-user logic, point it at LiteLLM.
Aggregator strategies beyond verbatim synthesis. The paper's verbatim-synthesis prompt is the default and the only first-class strategy. A best-of-N variant is on the roadmap.

Performance

	TTFT	Total latency	Quality (vs single A)
`mode=single` (passthrough)	upstream-A TTFT	upstream-A latency	baseline
`mode=moa` (default)	aggregator TTFT (≈ 2× single)	≈ 2.5× single	+3.8% – +6.6% on reasoning (per the Self-MoA paper)

The latency tax is real: you're making 3 LLM calls per user turn instead of 1. Use moa-router on the hard queries (math, multi-step reasoning, code), not on every conversational message. Most chat frontends can route based on a temperature: 0.7+ threshold or an explicit "think harder" toggle.

When to use moa-router

✅ Good fit:

You have two LLM endpoints (could be the same model behind two ports)
You want measurable quality lift on reasoning queries
You can tolerate 2-3× latency on the queries you route through it
You want the lift without rewriting your chat client

❌ Bad fit:

You only have one LLM endpoint
Your queries are all conversational (no reasoning/math/code burden — the lift mostly disappears)
Your latency budget is < 2× the single-call baseline

Tests

pip install -r requirements-dev.txt
pytest tests/        # 16 tests, ~0.3 s using recorded LLM fixtures

Compose with our other tools

You're using	Plug moa-router in like
tooltalk + Gemma 4	`tooltalk → moa-router → 2× tooltalks → 2× Gemma 4 ports`
Destiny Atelier	Settings → Connection → set baseUrl to `http://moa-router:8056/v1`
Open WebUI / LibreChat / Lobe Chat	OpenAI endpoint = `http://moa-router:8056/v1`
LiteLLM	Add a model entry pointing `api_base: http://moa-router:8056/v1`

License

MIT. Fork it, sell it, ship it.

Acknowledgements

The Self-MoA paper authors — Junlin Wang, Jue Wang, Yongming Liu, Vasilis Syrgkanis (ICLR 2025)
togethercomputer/MoA — reference implementation whose verbatim-synthesis aggregator prompt we ship
The Destiny Atelier sprint that ironed out the prod-grade streaming / cancellation / timeout handling

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moa-router

Self-MoA voting as a drop-in OpenAI-compatible proxy.
One endpoint in, two upstreams polled, one synthesized answer out.

What it is

Why this exists

Install

Docker

From source

Verify

Configure

Modes

What's supported

What's NOT supported (deliberate)

Performance

When to use moa-router

Tests

Compose with our other tools

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

moa-router

Self-MoA voting as a drop-in OpenAI-compatible proxy.One endpoint in, two upstreams polled, one synthesized answer out.

What it is

Why this exists

Install

Docker

From source

Verify

Configure

Modes

What's supported

What's NOT supported (deliberate)

Performance

When to use moa-router

Tests

Compose with our other tools

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Self-MoA voting as a drop-in OpenAI-compatible proxy.
One endpoint in, two upstreams polled, one synthesized answer out.

Packages