- Initialize single Cargo crate with library + binary:
- Library:
llmsim(src/lib.rs) - Binary:
llmsimwithservesubcommand (src/main.rs)
- Library:
- Configure
Cargo.tomlwith metadata (name, version, license = "MIT", authors) - Add initial dependencies:
tokio(async runtime)axum(HTTP framework)serde/serde_json(serialization)tiktoken-rs(token counting)rand(latency randomization)tracing(logging)clap(CLI argument parsing)
- Create basic CI workflow (
.github/workflows/ci.yml): format, lint, test
- Define OpenAI API types in
src/openai/types.rs:ChatCompletionRequestChatCompletionResponseChatCompletionChunk(for streaming)Message,Role,UsageToolCall,Function
- Add serde derive macros with proper field naming (
#[serde(rename_all = "snake_case")]) - Write unit tests for serialization/deserialization against real API examples
- Create
src/tokens.rs - Implement
count_tokens(text: &str, model: &str) -> usize - Support model-to-encoding mapping (gpt-4, gpt-5, claude, etc.)
- Add fallback for unknown models
- Write tests with known token counts
- Create
src/latency.rs - Define
LatencyProfilestruct:pub struct LatencyProfile { pub ttft_mean_ms: u64, // Time to first token pub ttft_stddev_ms: u64, pub tbt_mean_ms: u64, // Time between tokens pub tbt_stddev_ms: u64, }
- Implement preset profiles:
LatencyProfile::gpt5()- flagship modelLatencyProfile::gpt5_mini()- fasterLatencyProfile::o_series()- reasoning models (o3, o4)LatencyProfile::gpt4()- GPT-4 familyLatencyProfile::claude_opus()- Anthropic flagshipLatencyProfile::claude_sonnet()- balancedLatencyProfile::instant()- no delay (for fast tests)
- Implement
LatencyProfile::sample_ttft()andsample_tbt()using normal distribution - Write tests for distribution sanity
- Create
src/generator.rs - Implement
ResponseGeneratortrait:pub trait ResponseGenerator: Send + Sync { fn generate(&self, request: &ChatCompletionRequest) -> String; }
- Implement
LoremGenerator- generates lorem ipsum text - Implement
EchoGenerator- echoes back the user message - Implement
FixedGenerator- returns configured fixed response - Implement
RandomWordGenerator- random words to target token count - Add configurable response length (target token count)
- Create
src/stream.rs - Implement
TokenStreamthat yieldsChatCompletionChunkwith delays - Support SSE format (
data: {...}\n\n) - Handle
[DONE]termination message - Integrate with latency profiles for inter-token delays
- Write integration tests
- Create
src/errors.rs - Define
ErrorConfig:pub struct ErrorConfig { pub rate_limit_rate: f64, // 0.0-1.0 pub server_error_rate: f64, pub timeout_rate: f64, pub timeout_after_ms: u64, }
- Implement error decision logic
- Create proper OpenAI-format error responses
- Write tests for error rate distribution
- Create
src/rate_limit.rs - Implement token bucket algorithm
- Support requests-per-minute and tokens-per-minute limits
- Return proper 429 responses with
Retry-Afterheader
- Create
src/main.rswith clap subcommand structure - Implement
llmsim servesubcommand with CLI options:--port(default: 8080)--host(default: 0.0.0.0)--config(optional config file path)--generator(lorem, echo, random, fixed:text)--target-tokens(default: 100)- Note: Latency is auto-derived from model in each request
- Create
src/cli/module for server functionality - Set up Axum router with graceful shutdown
- Add health check endpoint (
GET /health) - Add tracing/logging setup
- Implement
POST /v1/chat/completions - Parse
ChatCompletionRequest - Handle
stream: truevsstream: false - Return proper
ChatCompletionResponsewith usage - Implement SSE streaming response
- Add request validation
- Write integration tests with reqwest
- Implement
GET /v1/models - Return list of "available" models with metadata (GPT-5, o-series, Claude, etc.)
- Implement
GET /v1/models/{model_id}
- Create
src/cli/config.rs - Support YAML config file:
server: port: 8080 host: "0.0.0.0" latency: profile: "gpt5" # or custom values response: generator: "lorem" target_tokens: 100 errors: rate_limit_rate: 0.01 server_error_rate: 0.001
- CLI arguments override config file values
- Validate configuration on startup
- Create
Dockerfile(multi-stage build) - Create
docker-compose.ymlfor easy local testing - Document Docker usage in README
- Extend types with
Tool,ToolChoice,FunctionCall - Parse tool definitions from request
- Validate tool call format
- Implement
ToolCallGenerator:- Random tool selection from available tools
- Generate plausible arguments based on parameter schema
- Support
tool_choice: "auto","none",{"type": "function", "function": {"name": "..."}} - Return proper
tool_callsarray in response
- Handle
role: "tool"messages in conversation - Track tool call IDs
- Generate appropriate follow-up responses
- Create
llmsim/src/anthropic/types.rs - Implement Anthropic message format
- Add
/v1/messagesendpoint - Support Anthropic streaming format (different from OpenAI)
- Handle Anthropic-specific headers (
x-api-key,anthropic-version)
- Create Responses API specification (
specs/responses-api.md) - Define Responses API types (
src/openai/responses.rs)-
ResponsesRequestwith input, model, instructions, etc. -
ResponsesResponsewith output items, usage, status -
InputItemandOutputItemtypes - Streaming chunk types for SSE
-
- Implement
POST /v1/responsesendpoint- Parse string or array input
- Generate simulated response with output items
- Return token usage statistics
- Implement streaming for Responses API
- SSE event types: response.created, response.output_text.delta, etc.
- Proper sequence numbering for deltas
- Add examples for Responses API
- Python example (
examples/responses_client.py) - Rust example (
examples/responses_usage.rs)
- Python example (
- Implement
/v1/threadsendpoints - Implement
/v1/threads/{thread_id}/messages - Implement
/v1/threads/{thread_id}/runs - Support run streaming
- Create
llmsim/src/gemini/types.rs - Implement Gemini message format
- Add
/v1beta/models/{model}:generateContentendpoint - Add
/v1beta/models/{model}:streamGenerateContentendpoint
- Create mock configuration format:
mocks: - match: content_contains: "weather" response: content: "The weather is sunny and 72°F." - match: model: "gpt-4" system_contains: "json" response: content: '{"result": "mocked"}'
- Implement pattern matching engine
- Support regex patterns
- Add mock priority/ordering
- Create
src/stats.rswith thread-safe atomic counters:- Request metrics: total, active, streaming, non-streaming
- Token metrics: prompt, completion, total
- Error tracking by status code (429, 5xx, 504)
- Latency: average, min, max
- Per-model request counts
- RPS: rolling 60-second window calculation
- Add stats endpoint (
GET /llmsim/stats) returning JSON snapshot - Create
src/tui/module with Ratatui dashboard:app.rs: Event loop, state management, HTTP pollingui.rs: Widget layout and rendering
- TUI features:
- Real-time updating (200ms refresh)
- Request and token statistics panels
- Latency and error metrics
- RPS and token rate sparkline charts
- Model distribution bar chart
- Keyboard controls (q=quit, r=refresh)
- Add
--tuiflag tollmsim servecommand - Add
on_completecallback to TokenStreamBuilder for streaming stats
- Add Prometheus metrics endpoint (
/metrics) - Track:
- Request count by endpoint and model
- Response latency histograms
- Token counts (input/output)
- Error rates
- Active connections
- Add structured logging with request IDs
- Implement proxy mode to real APIs
- Record requests/responses to file
- Replay recorded sessions
- Anonymize sensitive data in recordings
- Write comprehensive README.md:
- Quick start
- Installation (cargo, binary, Docker)
- Configuration reference
- API compatibility matrix
- Examples
- Add
docs/folder with detailed guides - Generate API documentation with rustdoc
- Achieve 80%+ code coverage
- Add load tests using k6 or similar
- Test against real client libraries (openai-python, anthropic-sdk)
- Fuzz testing for parser robustness
- Set up GitHub releases with binaries (Linux, macOS, Windows)
- Publish to crates.io
- Create Homebrew formula
- Announce on relevant communities
| Milestone | Description | Target Deliverable |
|---|---|---|
| M1 | Foundation | Compiling workspace with types |
| M2 | Core Library | Token counting, latency, generators work |
| M3 | Basic Server | OpenAI chat completions endpoint works |
| M4 | Tool Calling | Function calling support |
| M5 | Multi-API | Anthropic + Gemini support |
| M6 | Advanced | Mocking, metrics, record/replay |
| M7 | Release | Published, documented, tested |
Document significant technical decisions here as implementation progresses:
- Axum over Actix-web: Axum is simpler, well-integrated with tokio, and has good streaming support
- tiktoken-rs: Direct port of OpenAI's tokenizer, ensures accurate token counts
- YAML for config: More readable than JSON, better for complex configurations
- Single crate with lib + bin: Simpler structure with
llmsimas library andllmsim serveas CLI subcommand. Avoids workspace complexity while still exposing library for programmatic use - Clap subcommands: Using
llmsim servepattern allows future expansion with additional commands (e.g.,llmsim mock,llmsim record) - Model list from models.dev: GPT-5 family, o-series reasoning models, and Claude models based on current production models
- Ratatui for TUI: Leading Rust TUI library (fork of tui-rs), used by Codex CLI. Provides sub-millisecond rendering, rich widgets (sparklines, bar charts), and good documentation
- Atomic counters for stats: Thread-safe statistics using
std::sync::atomicwith relaxed ordering for minimal contention - Stats under /llmsim prefix: Keep LLMSim-specific endpoints separate from OpenAI-compatible /v1 routes
- TUI as --tui flag: Integrate dashboard into serve command rather than separate subcommand for simpler UX