A high-performance Rust inference engine for transformer models, built on Candle. Features optimized attention kernels, prefix caching, continuous batching with paged KV memory, and flexible quantization support.
- Fast Inference: Optimized CPU flash attention and GPU flash attention kernels; BF16 on Metal/CUDA, F32 on CPU
- Prefix Caching: Automatic KV cache reuse for common prompt prefixes, with per-request paged cache support in batch mode
- Continuous Batching: Process multiple concurrent requests together via batched prefill and decode; up to
max_batch_sizerequests in flight at once - Chunked Prefill: Split long prompts into fixed-size chunks processed across scheduler iterations, bounding peak attention-matrix memory and letting decode requests interleave with in-progress prefills
- Paged KV Cache: Block-based memory allocator (
BlockAllocator) isolates KV state per request, enabling true multi-request concurrency without interference - Scheduling Policies: FCFS (default), Priority, and Shortest-Job-First to control request ordering and reduce average latency
- Quantization:
Q4_K_M,Q5_K_M,Q8_0, and other GGUF quantization formats - Guided Generation: Grammar-constrained decoding via llguidance; JSON schema, Lark, and regex grammars with ~50 µs per-token overhead
- Thinking Mode: Enable chain-of-thought reasoning
- Detailed Statistics: Per-request prefill/decode timing, cache hit rate, and tokens-per-second reporting
- Tooling Support: Define your tools which a model can call
| Family | HuggingFace ID examples | Quantization | Thinking | Tools |
|---|---|---|---|---|
| Qwen3 | Qwen/Qwen3-0.6B, Qwen/Qwen3-4B, Qwen/Qwen3-8B |
SafeTensors (BF16) or GGUF (Q4_K_M, Q8_0, …) |
✅ --thinking flag injects /think tag |
✅ <tool_call> blocks |
| Granite 4.1 | ibm-granite/granite-4.1-3b |
SafeTensors (BF16) or GGUF (Q4_K_M, Q8_0, …) |
--thinking flag ignored |
✅ <tool_call> blocks via chat template |
Auto-detection reads model_type from config.json ("qwen3", "granite"), so passing the HuggingFace repo ID is enough — no extra flags required.
cargo run -- --model-id 'Qwen3-0.6B' --sample-len 200cargo run -- --model-id 'Qwen3-0.6B' --sample-len 200 --thinkingcargo run -- --model-id 'Qwen3-0.6B' --sample-len 200 \
--quantization Q4_K_M \
--temperature 0.6 \
--top-p 0.95 \
--top-k 20Prefix caching skips redundant computation for requests that share a common prompt prefix (e.g., a system prompt) by restoring previously computed KV states instead of re-running the forward pass over the shared tokens.
cargo run -- --model-id 'Qwen3-0.6B' --prefix-cache \
--cache-max-entries 100 \
--cache-max-tokens 2048For prompts with ~50% prefix overlap:
- Prefill time: ~50% reduction
- Time to first token: ~50% faster
- Memory overhead: ~1–2% for cache metadata
In batch mode the prefix cache integrates with paged KV caches so each request independently restores its cached prefix without interfering with other in-flight requests.
Continuous batching processes multiple concurrent requests in a single forward pass rather than one at a time, significantly increasing throughput on multi-request workloads.
cargo run -- --model-id 'Qwen3-0.6B' --enable-batching \
--max-batch-size 16 \
--max-prefill-batch 8 \
--max-decode-batch 16 \
--scheduling-policy fcfs| Policy | Flag | Behavior |
|---|---|---|
| FCFS | fcfs |
Admit requests in arrival order (default) |
| Priority | priority |
Use caller-supplied priority field |
| SJF | shortest_job_first |
Shorter prompts run first; reduces average latency |
Each request in a batch receives its own isolated KV cache backed by a shared BlockAllocator. Blocks are allocated on demand and freed when the request completes, so long-running requests don't starve shorter ones of memory.
Batch processing amortizes the fixed cost of a forward pass across multiple requests:
- Throughput: Near-linear scaling with batch size up to hardware memory limits
- Prefill + prefix cache: Cached prefix tokens are skipped per-request even inside a batch — requests sharing a system prompt pay the prefill cost only for their unique suffix
- Decode: All in-flight requests advance by one token per forward pass
Without chunked prefill, a request with a long prompt monopolises the GPU for the entire duration of its prefill forward pass. Every decode-phase request in the same batch stalls and waits, increasing their time-to-next-token proportionally to the long prompt's length.
Chunked prefill breaks the prompt into slices of at most chunk_size tokens. Each scheduler iteration processes one chunk, then immediately runs the pending decode batch. Decode latency is bounded by chunk_size rather than the full prompt length, and peak attention-matrix memory per iteration drops from O(prompt²) to O(chunk²).
Chunked prefill is a batching-mode feature — paged KV caches are required to hold intermediate KV state between chunks.
--prefill-chunk-size=512 # process at most 512 prompt tokens per scheduler iterationSetting prefill-chunk-size larger than any prompt in the workload degrades gracefully to single-shot prefill with no behavioural change.
| Chunk size | Effect |
|---|---|
| Small (≤ 64) | Very low per-iteration prefill cost; more scheduler iterations needed per prompt; best for latency-sensitive workloads with many concurrent decode requests |
| Medium (128–512) | Balanced trade-off; recommended starting point |
| Large (≥ 1024) | Similar to single-shot; useful mainly to cap memory spikes on extremely long prompts |
- Prefix cache: The first chunk (
kv_start_pos = 0) checks the prefix cache normally. Subsequent chunks skip the cache lookup and write new KV entries at their respective offsets. - Scheduling policy: All three policies (FCFS, Priority, SJF) work unchanged; chunked prefill only affects how many tokens of each selected request are processed per iteration.
- Token budget:
max_prefill_tokensis enforced against the chunk size, not the full prompt length, so oversized prompts no longer bypass the token budget guard.
Guided generation constrains the output to tokens that are valid under a grammar, guaranteeing syntactically correct structured output without relying on prompt engineering.
| Use case | Grammar type |
|---|---|
| JSON output parsed by the caller | JSON schema |
| Enum / classification | Regex or Lark |
| Tool calls / agent protocols | JSON schema |
| Domain-specific syntax (SQL, config) | Lark |
Free-text generation, open-ended responses, and any case where the model reliably follows format instructions without enforcement are better served without guidance.
Build a ParserFactory once (expensive; tied to tokenizer vocabulary) and share it across requests. Requests without a grammar run unconstrained regardless of whether a factory is configured.
Tool calling lets the model invoke external functions by emitting structured <tool_call> blocks in its output. The pipeline injects the tool schema into the prompt automatically and parses any calls from the generated text, returning them in a response.
- You define tools with a name, description, and JSON Schema parameters.
- Attach them to a
GenerationRequestvia.with_tools(…). - The pipeline formats the raw user message using the model's chat template with tools injected in
<tools>XML tags (Qwen3 format). - After generation, the pipeline scans the output for
<tool_call>…</tool_call>blocks and deserialises each one into aToolCall.
{
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and country, e.g. Paris, France"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}When tools are present, the pipeline formats calls internally — you do not need to pre-format the prompt yourself. Pass the raw user message as the prompt field.
Set --enable-thinking on the call to enable chain-of-thought reasoning alongside tool calling (supported on Qwen3 via the /think tag)
Tool calling and grammar-constrained generation can be used together. Attach both .with_tools(…) and .with_grammar(…) to the same request to restrict which tokens the model can emit while still parsing tool calls from the output.
Tur ships a web server (src/bin/server.rs) that exposes an OpenAI-compatible REST API. Any client that speaks the OpenAI Chat Completions protocol — curl, the openai Python SDK, LangChain, LlamaIndex, etc. — works without modification.
# Minimal — load a model from HuggingFace and listen on 127.0.0.1:8080
cargo run --bin server -- --model-id Qwen/Qwen3-0.6B
# Quantized model, custom host/port, and larger batch size
cargo run --bin server -- \
--model-id Qwen/Qwen3-0.6B \
--quantization Q4_K_M \
--host 0.0.0.0 \
--port 8080 \
--max-batch-size 16curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
}'Response:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1748000000,
"model": "Qwen/Qwen3-0.6B",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "The capital of France is Paris."},
"finish_reason": "stop"
}],
"usage": {"prompt_tokens": 0, "completion_tokens": 9, "total_tokens": 9}
}Add "stream": true to receive tokens as Server-Sent Events as they are generated:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Count to five."}],
"stream": true
}'Each SSE event carries a chat.completion.chunk object. The stream ends with data: [DONE].
Define tools in your request; the server injects the schema into the prompt and parses <tool_call> blocks from the model output:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}]
}'When the model decides to call a tool the response looks like:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"Tokyo\", \"unit\": \"celsius\"}"
}
}]
},
"finish_reason": "tool_calls"
}]
}Pass "thinking": true to enable chain-of-thought reasoning before the final answer:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Solve: if 2x + 3 = 11, what is x?"}],
"thinking": true
}'The server uses continuous batching: instead of processing requests one at a time, every incoming request is submitted to a shared ContinuousBatchScheduler. Each scheduler iteration:
- Admits any newly queued requests (memory-permitting).
- Prefills a batch of admitted requests in a single GPU forward pass.
- Decodes all in-progress requests together, advancing every sequence by one token per iteration.
- Streams each new token to its originating HTTP response immediately.
- Completes requests that hit EOS or
max_tokens, freeing their paged KV cache blocks.
This means N concurrent clients each get tokens interleaved in real time rather than waiting for the previous client to finish. Throughput scales near-linearly with batch size up to the VRAM limit.
Firing parallel requests with curl:
# Launch four simultaneous requests in the background
for i in 1 2 3 4; do
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen3-0.6B\",
\"messages\": [{\"role\": \"user\", \"content\": \"Tell me a fact number $i\"}]
}" &
done
waitStreaming four responses in parallel:
for i in 1 2 3 4; do
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
--no-buffer \
-d "{
\"model\": \"Qwen/Qwen3-0.6B\",
\"messages\": [{\"role\": \"user\", \"content\": \"Fact $i\"}],
\"stream\": true
}" > /tmp/stream_$i.txt &
done
wait
grep -h "content" /tmp/stream_*.txt | head -20Tuning --max-batch-size:
| Workload | Recommendation |
|---|---|
| Single user / interactive | 4 — low latency, minimal overhead |
| Multi-user API server | 8–16 — default; good throughput/latency balance |
| Batch inference / offline | 32+ — maximise GPU utilisation; needs more VRAM |
The scheduler's memory-based admission control prevents OOM: requests that would exceed available KV cache memory are queued until earlier requests complete and free their blocks.
--temperature 0.7 # Randomness (0.0 = greedy, 1.0 = creative)
--top-p 0.9 # Nucleus sampling threshold
--top-k 50 # Top-k sampling limit
--repeat-penalty 1.1 # Penalize token repetition
--seed 42 # Random seed for reproducibility--model-id Qwen3-0.6B # Download from HuggingFace
--weight-path /local/path/to/model # Load from local directoryTur achieves competitive performance through:
- Optimized Attention: Custom CPU flash attention and GPU flash attention (Metal/CUDA)
- Prefix Caching: KV state reuse for shared prefixes, integrated with paged memory
- Continuous Batching: Amortized forward-pass cost across concurrent requests
- Paged Memory: Block allocator eliminates memory fragmentation in multi-request workloads
- Quantization: 4-bit and 8-bit GGUF models reduce memory footprint and improve cache utilization
cargo benchAvailable benchmark groups:
| Group | What it measures |
|---|---|
prefill |
Prompt encoding + first forward pass latency |
decode |
Steady-state token generation throughput |
full_pipeline |
Cold-start model load + prefill + decode |
prefix_cache |
Prefill with vs. without prefix cache on repeated prefixes |
prefix_cache_lengths |
Cache hit speedup across varying prefix lengths |
batch_prefill |
Batched prefill throughput (batch sizes 1/2/4), no-cache vs. with paged prefix cache |
chunked_prefill |
Single-shot vs. chunked prefill (chunk sizes 16/32/64) for each prompt length; shows total overhead and per-chunk latency |
See LICENSE for details.
Contributions are welcome! Please feel free to submit issues and pull requests.
