Add opt-in disk persistence for the KV prefix cache by aidiffuser · Pull Request #2159 · exo-explore/exo

aidiffuser · 2026-06-10T08:32:35Z

Persists the prefix cache's hot slot to disk and restores it across conversation switches and runner restarts, so returning to an earlier long conversation reuses its KV cache instead of re-prefilling. Off by default — enable with EXO_KV_DISK_PERSISTENCE=1. Supersedes #1830.

How it works

Slots live under EXO_CACHE_HOME/kv-cache/<sha256(model_id)[:16]>/ (location overridable with EXO_KV_DISK_PATH) as slot_N_cache.safetensors + slot_N_tokens.safetensors + slot_N_meta.json, written atomically (tmp + rename, tokens last so a slot only becomes discoverable once complete).
The hot slot is flushed on conversation switch, when the runner's generation queue drains, and on shutdown.
On a cache miss, slots are matched by longest token prefix (≥ 1000 tokens); the reusable prefix is capped at the cache's stored length.
Eviction: a TTL (EXO_KV_DISK_TTL_HOURS, default 24) plus a global size cap across all models (EXO_KV_DISK_MAX_SIZE_GB, default 500) — when over the cap, the globally oldest slot is evicted first (cross-model LRU).
A janitor TTL-sweeps all model directories under the kv-cache root — at cache init and after every flush — so models that are never loaded again don't keep their slots forever.
Fail-safe by construction: any load problem logs a warning and falls back to recompute — a bad slot can never crash a runner.

Architecture coverage

Cache state isn't always a pure array tree: the DSA indexer cache (DeepSeek-V3.2 / GLM MoE DSA) stores zero-width value arrays, and DeepseekV4Cache branch state holds Nones, ints and int-lists — none of which safetensors can represent. Such leaves are recorded as JSON placeholder specs in slot_N_meta.json and re-inserted on load, keeping the .safetensors itself compatible with stock mlx_lm.save_prompt_cache / load_prompt_cache for standard caches.

Loading reconstructs cache classes through an explicit registry, which also covers classes load_prompt_cache cannot build (DeepseekV4Cache needs its sliding-window constructor argument, recovered from meta_state). Partial-prefix hits on non-trimmable caches (DeepseekV4, SSM) are skipped, since disk slots carry no snapshots to restore from.

Testing

Nine unit tests, no model weights needed:

round-trips for KVCache, CacheList with zero-width arrays, DeepseekV4Cache with mixed branch state, and ArraysCache with None entries
mlx-lm format compatibility in both directions (stock-written slots load here; slots written here load with stock load_prompt_cache)
the non-trimmable partial-prefix guard, prefix capping, and the opt-in default

Validated live on Llama 3.2 3B, Kimi K2.6 (2-node), GLM-5.1 (2-node), and DeepSeek-V4-Flash (2-node).

Known limitation

In multi-node instances each rank persists and matches its own shard's slots independently. Identical request streams produce identical slots, so ranks agree in practice; after asymmetric eviction or slot corruption a divergent decision is possible. Gathering the slot decision across ranks is a possible follow-up.

The trim_cache bug this work uncovered (V4 entries not restoring from snapshots) is fixed separately in #2158; this PR is independent of it and mergeable in either order.

Persist the prefix cache's hot slot to disk and restore it across conversation switches and runner restarts, so returning to an earlier long conversation reuses its KV cache instead of re-prefilling. Enable with EXO_KV_DISK_PERSISTENCE=1 (off by default). Mechanics: - slots live under EXO_CACHE_HOME/kv-cache/<sha256(model_id)[:16]>/ as slot_N_cache.safetensors + slot_N_tokens.safetensors + slot_N_meta.json, written atomically (tmp + rename; tokens last so a slot only becomes discoverable once complete) - the hot slot is flushed on conversation switch, when the runner's generation queue drains, and on shutdown - on a cache miss, slots are matched by longest token prefix (>= 1000 tokens) and the best one is swapped in; the reusable prefix is capped at the cache's stored length - stale slots are evicted by TTL (EXO_KV_DISK_TTL_HOURS, default 24) and a total size cap (EXO_KV_DISK_MAX_SIZE_GB, default 500) - any load problem logs a warning and falls back to recompute Cache state is not always a pure array tree: the DSA indexer cache in DeepSeek-V3.2/GLM-derived models stores zero-width value arrays, and DeepseekV4Cache branch state holds Nones, ints and int-lists — none of which safetensors can represent. Such leaves are recorded as JSON placeholder specs in slot_N_meta.json and re-inserted on load, keeping the .safetensors itself compatible with stock mlx-lm save_prompt_cache for standard caches. Loading reconstructs cache classes through an explicit registry, which also covers classes load_prompt_cache cannot build (DeepseekV4Cache needs its sliding window, recovered from meta_state). Partial-prefix hits on non-trimmable caches (DeepseekV4, SSM) are skipped since disk slots carry no snapshots. Validated live on Llama 3.2 3B, Kimi K2.6 (2-node), GLM-5.1 and DeepSeek-V4-Flash (2-node). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Per-flush eviction only sees its own model's dir and only runs while that model is loaded and producing new cache writes — slots of models never loaded again lived forever. Sweep every model dir under the kv-cache root once at KVPrefixCache init, by each slot's meta.json timestamp; the own dir is left to the hot-slot-aware eviction. Also removes legacy *_tokens.npy files belonging to swept slots. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

EXO_KV_DISK_MAX_SIZE_GB was enforced per model dir, so N models could hold N x cap on disk while the docs read as a total. Enforce it over the whole kv-cache root instead: while the total exceeds the cap, evict the globally oldest slot regardless of owning model (cross-model LRU — a daily-driver model's fresh slots survive, abandoned models' slots go first). The flushing model's own hot slot stays excluded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… at init A single model kept loaded for weeks never re-runs the init janitor, so other models' stale slots only fell to the global size cap. Run the cross-model TTL sweep from per-flush eviction as well. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…sk slot The update-vs-add decision after generation used prefix_hit_length / len(new_prompt) >= 0.5, which classifies a SIBLING conversation that merely shares a long prefix (two agent sessions with a common system prompt + bootstrap, observed: 31.9k shared tokens) as a continuation. The sibling then updated the matched entry in place, inherited its disk slot id, and the next flush overwrote the other conversation's slot — every session switch destroyed the other session's cache back to the shared prefix (observed live as two sessions ping-ponging slot_10). Replace the ratio test with an extension test on KVPrefixCache (should_update_entry): update in place only when the hit covers the stored prompt up to the prefill-rollback slack; otherwise add a new entry, which flushes the previous hot slot and takes a fresh slot id. Applied to all three save sites (batch, sequential, disaggregated prefill server). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

aidiffuser and others added 5 commits June 10, 2026 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in disk persistence for the KV prefix cache#2159

Add opt-in disk persistence for the KV prefix cache#2159
aidiffuser wants to merge 5 commits into
exo-explore:mainfrom
aidiffuser:kv-disk-persistence-v2

aidiffuser commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aidiffuser commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Architecture coverage

Testing

Known limitation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aidiffuser commented Jun 10, 2026 •

edited

Loading