Olla is a high-performance proxy and load balancer for LLM infrastructure, written in Go. It intelligently routes requests across local and remote inference nodes (Ollama, LM Studio, LiteLLM, vLLM, SGLang, Llamacpp, Lemonade, Anthropic, and OpenAI-compatible endpoints).
The project provides two proxy engines: Sherpa (simple, maintainable) and Olla (high-performance with advanced features).
Full documentation available at: https://thushan.github.io/olla/
make ready # Run before commit (test-short + test-race + fmt + vet + lint + align)
make ready-tools # Check code with tools only (fmt + vet + lint + align)
make test # Run all tests
make test-race # Run tests with race detection
make test-stress # Run comprehensive stress tests
make bench # Run all benchmarks
make bench-balancer # Run balancer benchmarks
make build # Build optimised binary with version info
make build-local # Build binary to ./build/ (fast, for testing)
make run # Run with version info
make run-debug # Run with debug logging
make docker-build # Build Docker image with goreleaser (requires goreleaser)
make docker-build-local # Build Docker image locally for amd64 (no goreleaser required)
make docker-build-local-arm64 # Build Docker image locally for ARM64 (no goreleaser required)
make docker-run # Run Docker image with local config (amd64)
make ci # Run full CI pipeline locally
make help # Show all available targetsolla/
├── main.go # Entry point, initialises services
├── config.yaml # Default configuration
├── config/
│ ├── profiles/ # Provider-specific profiles
│ │ ├── ollama.yaml # Ollama configuration
│ │ ├── llamacpp.yaml # llama.cpp configuration
│ │ ├── lmstudio.yaml # LM Studio configuration
│ │ ├── lemonade.yaml # Lemonade SDK configuration
│ │ ├── litellm.yaml # LiteLLM gateway configuration
│ │ ├── vllm.yaml # vLLM configuration
│ │ ├── sglang.yaml # SGLang configuration
│ │ ├── openai-compatible.yaml # OpenAI-compatible generic profile
│ │ └── openai.yaml # OpenAI profile
│ ├── models.yaml # Model configurations
│ └── config.local.yaml # Local configuration overrides (user, not committed to git)
├── internal/
│ ├── core/ # Domain layer (business logic)
│ │ ├── domain/ # Core entities
│ │ ├── ports/ # Interface definitions
│ │ └── constants/ # Application constants
│ ├── adapter/ # Infrastructure layer
│ │ ├── balancer/ # Load balancing strategies
│ │ ├── converter/ # Model format converters
│ │ ├── discovery/ # Service discovery
│ │ ├── factory/ # Factory patterns
│ │ ├── filter/ # Request/response filtering
│ │ ├── health/ # Health checking & circuit breakers
│ │ ├── inspector/ # Request inspection
│ │ ├── metrics/ # Metrics collection
│ │ ├── proxy/ # Proxy implementations
│ │ │ ├── sherpa/ # Simple, maintainable proxy
│ │ │ ├── olla/ # High-performance proxy
│ │ │ └── core/ # Shared proxy components
│ │ ├── registry/ # Model & profile registries
│ │ ├── security/ # Security features (rate/size limits)
│ │ ├── stats/ # Statistics collection
│ │ ├── translator/ # API translation layer (OpenAI ↔ Provider)
│ │ └── unifier/ # Model unification
│ ├── app/ # Application layer
│ │ ├── handlers/ # HTTP handlers
│ │ │ ├── server.go # HTTP server setup
│ │ │ ├── server_routes.go # Route registration
│ │ │ ├── handler_proxy.go # Main proxy handler
│ │ │ ├── handler_provider_*.go # Provider-specific handlers
│ │ │ ├── handler_translation.go # Translation handler
│ │ │ ├── handler_status*.go # Status endpoints
│ │ │ ├── handler_health.go # Health endpoints
│ │ │ └── handler_version.go # Version information
│ │ ├── middleware/ # HTTP middleware
│ │ └── services/ # Application services
│ ├── config/ # Configuration management
│ ├── env/ # Environment handling
│ ├── integration/ # Integration tests
│ ├── logger/ # Logging framework
│ ├── router/ # Routing logic
│ ├── util/ # Utilities
│ └── version/ # Version management
├── pkg/ # Reusable packages
│ ├── container/ # Dependency injection
│ ├── eventbus/ # Event bus (pub/sub)
│ ├── format/ # Formatting utilities
│ ├── nerdstats/ # Process statistics
│ ├── pool/ # Object pooling
│ └── profiler/ # Profiling support
└── test/
└── scripts/ # Test scripts
├── auth/ # Authentication tests
├── cases/ # Test cases
├── load/ # Load testing
├── logic/ # Logic & routing tests
├── platform/ # Platform-specific tests
├── security/ # Security tests
└── streaming/ # Streaming tests
main.go- Application entry pointconfig.yaml- Main configurationinternal/app/handlers/server_routes.go- Route registration & API setupinternal/app/handlers/handler_proxy.go- Request routing logicinternal/app/handlers/handler_translation.go- Translation handler with passthrough logicinternal/adapter/proxy/sherpa/service.go- Sherpa proxy implementationinternal/adapter/proxy/olla/service.go- Olla proxy implementationinternal/adapter/translator/- API translation layer (OpenAI ↔ Provider formats)internal/adapter/translator/types.go- PassthroughCapable interface and translator typesinternal/adapter/translator/anthropic/- Anthropic translator implementationinternal/adapter/stats/translator_collector.go- Translator metrics collectorinternal/adapter/balancer/sticky.go- Sticky session wrapperinternal/app/handlers/handler_stats_sticky.go- Sticky session stats endpointinternal/core/constants/translator.go- TranslatorMode and FallbackReason constantsinternal/core/ports/stats.go- StatsCollector interface with translator trackinginternal/core/domain/profile_config.go- AnthropicSupportConfig for backend profilesconfig/profiles/*.yaml- Backend profiles withanthropic_supportsectionsinternal/version/version.go- Version information embedded at build time/test/scripts/logic/test-model-routing.sh- Test routing & headers
/internal/health- Health check endpoint/internal/status- Endpoint status/internal/status/endpoints- Endpoints status details/internal/status/models- Models status details/internal/stats/models- Model statistics/internal/stats/translators- Translator statistics/internal/stats/sticky- Sticky session statistics (returns{"enabled":false}when disabled)/internal/process- Process statistics/version- Version information
/olla/models- Unified models listing with filtering/olla/models/{id}- Get unified model by ID or alias
/olla/proxy/- Olla API proxy endpoint (POST)/olla/proxy/v1/models- OpenAI-compatible models listing (GET)
Dynamically registered based on configured translators (e.g., Anthropic Messages API)
/olla/anthropic/v1/messages- Anthropic Messages API (POST) - supports passthrough and translation modes/olla/anthropic/v1/models- List models in Anthropic format (GET)/olla/anthropic/v1/messages/count_tokens- Token count estimation (POST)
X-Olla-Endpoint: Backend nameX-Olla-Model: Model usedX-Olla-Backend-Type: ollama/openai/openai-compatible/lm-studio/vllm/sglang/llamacpp/lemonadeX-Olla-Request-ID: Request IDX-Olla-Response-Time: Total processing timeX-Olla-Mode: Translator mode used (passthroughor absent for translation) - set on Anthropic translator requestsX-Olla-Routing-Strategy: Routing strategy used (when model routing is active)X-Olla-Routing-Decision: Routing decision made (routed/fallback/rejected)X-Olla-Routing-Reason: Human-readable reason for routing decisionX-Olla-Sticky-Session: Sticky session status (hit/miss/repin/disabled)X-Olla-Sticky-Key-Source: Key source used (session_header/prefix_hash/auth_header/ip/none)X-Olla-Session-ID: Echoed session ID when client supplies one
- Unit Tests: Components in isolation
- Integration Tests: Full request flow through proxy engines
- Benchmark Tests: Performance comparison (balancers, proxy engines, repositories)
- Security Tests: Rate limiting and size restrictions (see
/test/scripts/security/) - Stress Tests: Comprehensive testing under load
- Script Tests: End-to-end scenarios in
/test/scripts/
# Core test commands
make test # Run all tests
make test-race # Run with race detection
make test-stress # Run stress tests
make test-cover-html # Generate coverage HTML report
# Benchmark commands
make bench # Run all benchmarks
make bench-balancer # Run balancer benchmarks
make bench-repo # Run repository benchmarks
# Specific test patterns
go test -v ./internal/adapter/proxy -run TestAllProxies
go test -v ./internal/adapter/proxy -run TestSherpa
go test -v ./internal/adapter/proxy -run TestOllaAlways run make ready before committing changes.
- Domain Layer (
internal/core): Business logic, entities, and interfaces - Infrastructure Layer (
internal/adapter): Implementations (proxies, balancers, registries) - Application Layer (
internal/app): HTTP handlers, middleware, and services
- Translator Layer: Enables API format translation (e.g., OpenAI ↔ Anthropic) with passthrough optimisation for backends with native support
- Passthrough Mode: When a backend natively supports the Anthropic Messages API (vLLM, llama.cpp, LM Studio, Ollama), requests bypass translation entirely
- Translator Metrics: Thread-safe per-translator statistics tracking passthrough/translation rates, fallback reasons, latency, and streaming breakdown (
internal/adapter/stats/translator_collector.go) - Sticky Sessions: Optional decorator on the endpoint selector that pins multi-turn LLM conversations to the backend that handled the first turn, maximising KV-cache reuse. FNV-64a hashed keys, TTL + LRU bounded, purged on routable→non-routable health transitions (
internal/adapter/balancer/sticky.go) - Proxy Engines: Choose Sherpa (simple) or Olla (high-performance)
- Load Balancing: Priority-based recommended for production
- Version Management: Build-time version injection via
internal/version
- Go 1.24+
- Australian English for comments and documentation
- Comment on why rather than what
- Always run
make readybefore committing - Use
make helpto see all available commands
"github.com/docker/go-units" // Human-readable sizes
"github.com/json-iterator/go" // High-performance JSON encoding/decoding
"github.com/puzpuzpuz/xsync/v4" // Concurrent maps/counters
"github.com/tidwall/gjson" // Fast JSON parsing
"github.com/jellydator/ttlcache" // Time-to-live cache
"golang.org/x/sync" // errgroup
"golang.org/x/time" // rate limitingDo not add additional dependencies unless explicitly asked.
CRITICAL: Always delegate tasks to the appropriate subagent. Do NOT perform work directly in the main context.
- Code Review → Use the appropriate language subagent (Eg. Go Architect) or reviewer subagent
- Code changes → Use the appropriate language subagent (Eg. Go Architect) or implementer subagent
- Research/exploration → Use the explore subagent
- Testing → Use the test subagent
Only use the main context for orchestration and task decomposition.