Add opt-in disk persistence for the KV prefix cache#2159
Open
aidiffuser wants to merge 5 commits into
Open
Conversation
Persist the prefix cache's hot slot to disk and restore it across conversation switches and runner restarts, so returning to an earlier long conversation reuses its KV cache instead of re-prefilling. Enable with EXO_KV_DISK_PERSISTENCE=1 (off by default). Mechanics: - slots live under EXO_CACHE_HOME/kv-cache/<sha256(model_id)[:16]>/ as slot_N_cache.safetensors + slot_N_tokens.safetensors + slot_N_meta.json, written atomically (tmp + rename; tokens last so a slot only becomes discoverable once complete) - the hot slot is flushed on conversation switch, when the runner's generation queue drains, and on shutdown - on a cache miss, slots are matched by longest token prefix (>= 1000 tokens) and the best one is swapped in; the reusable prefix is capped at the cache's stored length - stale slots are evicted by TTL (EXO_KV_DISK_TTL_HOURS, default 24) and a total size cap (EXO_KV_DISK_MAX_SIZE_GB, default 500) - any load problem logs a warning and falls back to recompute Cache state is not always a pure array tree: the DSA indexer cache in DeepSeek-V3.2/GLM-derived models stores zero-width value arrays, and DeepseekV4Cache branch state holds Nones, ints and int-lists — none of which safetensors can represent. Such leaves are recorded as JSON placeholder specs in slot_N_meta.json and re-inserted on load, keeping the .safetensors itself compatible with stock mlx-lm save_prompt_cache for standard caches. Loading reconstructs cache classes through an explicit registry, which also covers classes load_prompt_cache cannot build (DeepseekV4Cache needs its sliding window, recovered from meta_state). Partial-prefix hits on non-trimmable caches (DeepseekV4, SSM) are skipped since disk slots carry no snapshots. Validated live on Llama 3.2 3B, Kimi K2.6 (2-node), GLM-5.1 and DeepSeek-V4-Flash (2-node). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per-flush eviction only sees its own model's dir and only runs while that model is loaded and producing new cache writes — slots of models never loaded again lived forever. Sweep every model dir under the kv-cache root once at KVPrefixCache init, by each slot's meta.json timestamp; the own dir is left to the hot-slot-aware eviction. Also removes legacy *_tokens.npy files belonging to swept slots. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
EXO_KV_DISK_MAX_SIZE_GB was enforced per model dir, so N models could hold N x cap on disk while the docs read as a total. Enforce it over the whole kv-cache root instead: while the total exceeds the cap, evict the globally oldest slot regardless of owning model (cross-model LRU — a daily-driver model's fresh slots survive, abandoned models' slots go first). The flushing model's own hot slot stays excluded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… at init A single model kept loaded for weeks never re-runs the init janitor, so other models' stale slots only fell to the global size cap. Run the cross-model TTL sweep from per-flush eviction as well. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sk slot The update-vs-add decision after generation used prefix_hit_length / len(new_prompt) >= 0.5, which classifies a SIBLING conversation that merely shares a long prefix (two agent sessions with a common system prompt + bootstrap, observed: 31.9k shared tokens) as a continuation. The sibling then updated the matched entry in place, inherited its disk slot id, and the next flush overwrote the other conversation's slot — every session switch destroyed the other session's cache back to the shared prefix (observed live as two sessions ping-ponging slot_10). Replace the ratio test with an extension test on KVPrefixCache (should_update_entry): update in place only when the hit covers the stored prompt up to the prefill-rollback slack; otherwise add a new entry, which flushes the previous hot slot and takes a fresh slot id. Applied to all three save sites (batch, sequential, disaggregated prefill server). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Persists the prefix cache's hot slot to disk and restores it across conversation switches and runner restarts, so returning to an earlier long conversation reuses its KV cache instead of re-prefilling. Off by default — enable with
EXO_KV_DISK_PERSISTENCE=1. Supersedes #1830.How it works
EXO_CACHE_HOME/kv-cache/<sha256(model_id)[:16]>/(location overridable withEXO_KV_DISK_PATH) asslot_N_cache.safetensors+slot_N_tokens.safetensors+slot_N_meta.json, written atomically (tmp + rename, tokens last so a slot only becomes discoverable once complete).EXO_KV_DISK_TTL_HOURS, default 24) plus a global size cap across all models (EXO_KV_DISK_MAX_SIZE_GB, default 500) — when over the cap, the globally oldest slot is evicted first (cross-model LRU).Architecture coverage
Cache state isn't always a pure array tree: the DSA indexer cache (DeepSeek-V3.2 / GLM MoE DSA) stores zero-width value arrays, and
DeepseekV4Cachebranch state holdsNones, ints and int-lists — none of which safetensors can represent. Such leaves are recorded as JSON placeholder specs inslot_N_meta.jsonand re-inserted on load, keeping the.safetensorsitself compatible with stockmlx_lm.save_prompt_cache/load_prompt_cachefor standard caches.Loading reconstructs cache classes through an explicit registry, which also covers classes
load_prompt_cachecannot build (DeepseekV4Cacheneeds its sliding-window constructor argument, recovered frommeta_state). Partial-prefix hits on non-trimmable caches (DeepseekV4, SSM) are skipped, since disk slots carry no snapshots to restore from.Testing
Nine unit tests, no model weights needed:
KVCache,CacheListwith zero-width arrays,DeepseekV4Cachewith mixed branch state, andArraysCachewithNoneentriesload_prompt_cache)Validated live on Llama 3.2 3B, Kimi K2.6 (2-node), GLM-5.1 (2-node), and DeepSeek-V4-Flash (2-node).
Known limitation
In multi-node instances each rank persists and matches its own shard's slots independently. Identical request streams produce identical slots, so ranks agree in practice; after asymmetric eviction or slot corruption a divergent decision is possible. Gathering the slot decision across ranks is a possible follow-up.
The
trim_cachebug this work uncovered (V4 entries not restoring from snapshots) is fixed separately in #2158; this PR is independent of it and mergeable in either order.