Skip to content

karany97/moa-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

moa-router

Self-MoA voting as a drop-in OpenAI-compatible proxy.
One endpoint in, two upstreams polled, one synthesized answer out.

License: MIT Paper: Self-MoA (ICLR 2025) Surface: FastAPI


What it is

A 627-line FastAPI proxy that implements Self-MoA (ICLR 2025) as an OpenAI-shape /v1/chat/completions endpoint. You point your chat frontend at moa-router; it fans every query to two upstream LLM endpoints (with different temperature/sampling), then a third pass through one of them synthesizes the two candidates verbatim into a single best answer.

Published Self-MoA lift: +3.8% – +6.6% on AlpacaEval / MATH / GSM8K vs the strongest single upstream, with the same model on both sides. The headline insight from the paper is that the ensemble doesn't need diverse models — just diverse samples from the same model.

              POST /v1/chat/completions
                          │
                          ▼
                  ┌────────────────┐
                  │   moa-router   │  :8056
                  └─┬───────────┬──┘
        T=0.5      │           │      T=0.9
    ┌─────────────┘           └────────────┐
    ▼                                      ▼
 upstream-A                            upstream-B
 (any OpenAI-shape)                    (any OpenAI-shape)
    │                                      │
    └────────┬─────────────────┬───────────┘
             ▼                 ▼
       candidate 1        candidate 2
             │                 │
             └────────┬────────┘
                      ▼
              ┌───────────────┐
              │  AGGREGATOR   │  ← one more call to upstream-A
              │  (Self-MoA    │     with the togethercomputer/MoA
              │   synthesis)  │     verbatim-synthesis prompt
              └──────┬────────┘
                     ▼
            single OpenAI response
            streams back to client

Why this exists

Self-MoA papers exist. The togethercomputer/MoA reference exists. But there's no stateless, OpenAI-compatible proxy that drops into any frontend. Most implementations are notebooks or framework-coupled (LangGraph, AutoGen, etc).

moa-router is the boring proxy: stand it up between your chat and your LLM endpoints, and the lift lands in production without your frontend knowing anything changed.

Install

Docker

docker run --rm -d \
  --name moa-router \
  -p 8056:8056 \
  -e UPSTREAM_A_URL=http://your-llm-a:8000/v1/chat/completions \
  -e UPSTREAM_B_URL=http://your-llm-b:8000/v1/chat/completions \
  ghcr.io/karany97/moa-router:latest

From source

git clone https://github.com/karany97/moa-router.git
cd moa-router
pip install -r requirements.txt
UPSTREAM_A_URL=http://your-llm-a:8000/v1/chat/completions \
UPSTREAM_B_URL=http://your-llm-b:8000/v1/chat/completions \
  uvicorn app:app --host 0.0.0.0 --port 8056

Verify

# Health
curl http://localhost:8056/health
# → {"ok":true,"upstream_a":"http://...","upstream_b":"http://..."}

# Chat (Self-MoA mode — default)
curl http://localhost:8056/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "moa-router",
  "messages": [{"role":"user","content":"What is 17 * 24 + sqrt(144)?"}]
}'

# Bypass mode — single upstream only (for A/B comparison)
curl 'http://localhost:8056/v1/chat/completions?mode=single' -H ... -d '{...}'

Configure

Env var Default What it does
UPSTREAM_A_URL http://upstream-a:8000/v1/chat/completions Primary LLM endpoint (also the aggregator)
UPSTREAM_B_URL http://upstream-b:8000/v1/chat/completions Secondary LLM endpoint (provides candidate B)
UPSTREAM_API_KEY (empty) Bearer token forwarded to both upstreams
MOA_TEMP_A 0.5 Temperature for candidate A
MOA_TEMP_B 0.9 Temperature for candidate B
MOA_AGGREGATOR_TEMP 0.3 Temperature for the synthesis pass
MOA_TIMEOUT 30 Per-upstream timeout (seconds)
MOA_DEFAULT_MODEL moa-router The model name reported in responses
PORT 8056 moa-router's listen port

Modes

Mode Trigger What happens
moa Default Fan to A + B, aggregate via A
single ?mode=single query param Direct passthrough to upstream A
best-of-2 (planned) ?mode=bo2 Same as moa, but the aggregator picks the better candidate instead of synthesizing
race (planned) ?mode=race First-to-respond wins, other is cancelled

The single mode is critical for A/B benchmarking — you can hit the same endpoint with and without ?mode=single and compare quality + latency on identical inputs.

What's supported

Surface Status
POST /v1/chat/completions (non-stream)
POST /v1/chat/completions (stream — aggregator streams; candidates collected first)
GET /health
Multi-turn conversations
Tool calls (passthrough) ✅ (no aggregation logic specific to tools yet — both upstreams must return identical tool calls or one wins)
Vision inputs ✅ if both upstreams support them
Streaming with usage.cost

What's NOT supported (deliberate)

  • Auth. Put a reverse-proxy in front. Same pattern as our tooltalk repo.
  • Rate-limiting. Same.
  • Per-user model routing. moa-router knows two upstreams, period. For per-user logic, point it at LiteLLM.
  • Aggregator strategies beyond verbatim synthesis. The paper's verbatim-synthesis prompt is the default and the only first-class strategy. A best-of-N variant is on the roadmap.

Performance

TTFT Total latency Quality (vs single A)
mode=single (passthrough) upstream-A TTFT upstream-A latency baseline
mode=moa (default) aggregator TTFT (≈ 2× single) ≈ 2.5× single +3.8% – +6.6% on reasoning (per the Self-MoA paper)

The latency tax is real: you're making 3 LLM calls per user turn instead of 1. Use moa-router on the hard queries (math, multi-step reasoning, code), not on every conversational message. Most chat frontends can route based on a temperature: 0.7+ threshold or an explicit "think harder" toggle.

When to use moa-router

✅ Good fit:

  • You have two LLM endpoints (could be the same model behind two ports)
  • You want measurable quality lift on reasoning queries
  • You can tolerate 2-3× latency on the queries you route through it
  • You want the lift without rewriting your chat client

❌ Bad fit:

  • You only have one LLM endpoint
  • Your queries are all conversational (no reasoning/math/code burden — the lift mostly disappears)
  • Your latency budget is < 2× the single-call baseline

Tests

pip install -r requirements-dev.txt
pytest tests/        # 16 tests, ~0.3 s using recorded LLM fixtures

Compose with our other tools

You're using Plug moa-router in like
tooltalk + Gemma 4 tooltalk → moa-router → 2× tooltalks → 2× Gemma 4 ports
Destiny Atelier Settings → Connection → set baseUrl to http://moa-router:8056/v1
Open WebUI / LibreChat / Lobe Chat OpenAI endpoint = http://moa-router:8056/v1
LiteLLM Add a model entry pointing api_base: http://moa-router:8056/v1

License

MIT. Fork it, sell it, ship it.

Acknowledgements

About

Self-MoA (ICLR 2025) as a drop-in OpenAI-compatible proxy. Fan to 2 LLM upstreams, synthesize verbatim. +3.8%–6.6% on reasoning. 627 LOC FastAPI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors