Overview
A set of strategies to reduce API costs and manage context efficiently. Combining 1 + 4 gives the best results.
Strategies
1. Prompt Caching (Anthropic API)
Repeated prefixes are charged at 10% the normal rate. Cache the stable system prompt + memory files.
- Impact: Often 70–90% cost reduction on long-context conversations.
- Priority: Biggest single win.
2. Sliding Window with Sticky Head
Keep the system prompt + most recent N messages, drop the middle. Simple and effective.
3. Compaction
When context exceeds a threshold, summarize older messages into a shorter representation.
- Loses some fidelity but caps the bill.
4. Memory Files + Retrieval
Extract important facts to disk, query on demand instead of keeping in context.
- This is what Claude Code does internally.
5. Tool Result Truncation
If tools return large outputs (web fetches, file reads), summarize before adding to context. Don't store raw HTML.
6. Two-Tier Model
Use Haiku for context-management / summarization tasks, Opus only for actual responses.
- ~10–20× cheaper for the summarization step.
Recommendation
Start with 1 (prompt caching) + 4 (memory files + retrieval) for the best cost/complexity trade-off.
Overview
A set of strategies to reduce API costs and manage context efficiently. Combining 1 + 4 gives the best results.
Strategies
1. Prompt Caching (Anthropic API)
Repeated prefixes are charged at 10% the normal rate. Cache the stable system prompt + memory files.
2. Sliding Window with Sticky Head
Keep the system prompt + most recent N messages, drop the middle. Simple and effective.
3. Compaction
When context exceeds a threshold, summarize older messages into a shorter representation.
4. Memory Files + Retrieval
Extract important facts to disk, query on demand instead of keeping in context.
5. Tool Result Truncation
If tools return large outputs (web fetches, file reads), summarize before adding to context. Don't store raw HTML.
6. Two-Tier Model
Use Haiku for context-management / summarization tasks, Opus only for actual responses.
Recommendation
Start with 1 (prompt caching) + 4 (memory files + retrieval) for the best cost/complexity trade-off.