Skip to content

Add opt-in disk persistence for the KV prefix cache#2159

Open
aidiffuser wants to merge 5 commits into
exo-explore:mainfrom
aidiffuser:kv-disk-persistence-v2
Open

Add opt-in disk persistence for the KV prefix cache#2159
aidiffuser wants to merge 5 commits into
exo-explore:mainfrom
aidiffuser:kv-disk-persistence-v2

Conversation

@aidiffuser

@aidiffuser aidiffuser commented Jun 10, 2026

Copy link
Copy Markdown

Persists the prefix cache's hot slot to disk and restores it across conversation switches and runner restarts, so returning to an earlier long conversation reuses its KV cache instead of re-prefilling. Off by default — enable with EXO_KV_DISK_PERSISTENCE=1. Supersedes #1830.

How it works

  • Slots live under EXO_CACHE_HOME/kv-cache/<sha256(model_id)[:16]>/ (location overridable with EXO_KV_DISK_PATH) as slot_N_cache.safetensors + slot_N_tokens.safetensors + slot_N_meta.json, written atomically (tmp + rename, tokens last so a slot only becomes discoverable once complete).
  • The hot slot is flushed on conversation switch, when the runner's generation queue drains, and on shutdown.
  • On a cache miss, slots are matched by longest token prefix (≥ 1000 tokens); the reusable prefix is capped at the cache's stored length.
  • Eviction: a TTL (EXO_KV_DISK_TTL_HOURS, default 24) plus a global size cap across all models (EXO_KV_DISK_MAX_SIZE_GB, default 500) — when over the cap, the globally oldest slot is evicted first (cross-model LRU).
  • A janitor TTL-sweeps all model directories under the kv-cache root — at cache init and after every flush — so models that are never loaded again don't keep their slots forever.
  • Fail-safe by construction: any load problem logs a warning and falls back to recompute — a bad slot can never crash a runner.

Architecture coverage

Cache state isn't always a pure array tree: the DSA indexer cache (DeepSeek-V3.2 / GLM MoE DSA) stores zero-width value arrays, and DeepseekV4Cache branch state holds Nones, ints and int-lists — none of which safetensors can represent. Such leaves are recorded as JSON placeholder specs in slot_N_meta.json and re-inserted on load, keeping the .safetensors itself compatible with stock mlx_lm.save_prompt_cache / load_prompt_cache for standard caches.

Loading reconstructs cache classes through an explicit registry, which also covers classes load_prompt_cache cannot build (DeepseekV4Cache needs its sliding-window constructor argument, recovered from meta_state). Partial-prefix hits on non-trimmable caches (DeepseekV4, SSM) are skipped, since disk slots carry no snapshots to restore from.

Testing

Nine unit tests, no model weights needed:

  • round-trips for KVCache, CacheList with zero-width arrays, DeepseekV4Cache with mixed branch state, and ArraysCache with None entries
  • mlx-lm format compatibility in both directions (stock-written slots load here; slots written here load with stock load_prompt_cache)
  • the non-trimmable partial-prefix guard, prefix capping, and the opt-in default

Validated live on Llama 3.2 3B, Kimi K2.6 (2-node), GLM-5.1 (2-node), and DeepSeek-V4-Flash (2-node).

Known limitation

In multi-node instances each rank persists and matches its own shard's slots independently. Identical request streams produce identical slots, so ranks agree in practice; after asymmetric eviction or slot corruption a divergent decision is possible. Gathering the slot decision across ranks is a possible follow-up.


The trim_cache bug this work uncovered (V4 entries not restoring from snapshots) is fixed separately in #2158; this PR is independent of it and mergeable in either order.

aidiffuser and others added 5 commits June 10, 2026 10:24
Persist the prefix cache's hot slot to disk and restore it across
conversation switches and runner restarts, so returning to an earlier
long conversation reuses its KV cache instead of re-prefilling.
Enable with EXO_KV_DISK_PERSISTENCE=1 (off by default).

Mechanics:
- slots live under EXO_CACHE_HOME/kv-cache/<sha256(model_id)[:16]>/
  as slot_N_cache.safetensors + slot_N_tokens.safetensors +
  slot_N_meta.json, written atomically (tmp + rename; tokens last so a
  slot only becomes discoverable once complete)
- the hot slot is flushed on conversation switch, when the runner's
  generation queue drains, and on shutdown
- on a cache miss, slots are matched by longest token prefix
  (>= 1000 tokens) and the best one is swapped in; the reusable prefix
  is capped at the cache's stored length
- stale slots are evicted by TTL (EXO_KV_DISK_TTL_HOURS, default 24)
  and a total size cap (EXO_KV_DISK_MAX_SIZE_GB, default 500)
- any load problem logs a warning and falls back to recompute

Cache state is not always a pure array tree: the DSA indexer cache in
DeepSeek-V3.2/GLM-derived models stores zero-width value arrays, and
DeepseekV4Cache branch state holds Nones, ints and int-lists — none of
which safetensors can represent. Such leaves are recorded as JSON
placeholder specs in slot_N_meta.json and re-inserted on load, keeping
the .safetensors itself compatible with stock mlx-lm save_prompt_cache
for standard caches. Loading reconstructs cache classes through an
explicit registry, which also covers classes load_prompt_cache cannot
build (DeepseekV4Cache needs its sliding window, recovered from
meta_state). Partial-prefix hits on non-trimmable caches (DeepseekV4,
SSM) are skipped since disk slots carry no snapshots.

Validated live on Llama 3.2 3B, Kimi K2.6 (2-node), GLM-5.1 and
DeepSeek-V4-Flash (2-node).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per-flush eviction only sees its own model's dir and only runs while
that model is loaded and producing new cache writes — slots of models
never loaded again lived forever. Sweep every model dir under the
kv-cache root once at KVPrefixCache init, by each slot's meta.json
timestamp; the own dir is left to the hot-slot-aware eviction. Also
removes legacy *_tokens.npy files belonging to swept slots.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
EXO_KV_DISK_MAX_SIZE_GB was enforced per model dir, so N models could
hold N x cap on disk while the docs read as a total. Enforce it over
the whole kv-cache root instead: while the total exceeds the cap,
evict the globally oldest slot regardless of owning model (cross-model
LRU — a daily-driver model's fresh slots survive, abandoned models'
slots go first). The flushing model's own hot slot stays excluded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… at init

A single model kept loaded for weeks never re-runs the init janitor,
so other models' stale slots only fell to the global size cap. Run the
cross-model TTL sweep from per-flush eviction as well.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sk slot

The update-vs-add decision after generation used
prefix_hit_length / len(new_prompt) >= 0.5, which classifies a SIBLING
conversation that merely shares a long prefix (two agent sessions with
a common system prompt + bootstrap, observed: 31.9k shared tokens) as
a continuation. The sibling then updated the matched entry in place,
inherited its disk slot id, and the next flush overwrote the other
conversation's slot — every session switch destroyed the other
session's cache back to the shared prefix (observed live as two
sessions ping-ponging slot_10).

Replace the ratio test with an extension test on KVPrefixCache
(should_update_entry): update in place only when the hit covers the
stored prompt up to the prefill-rollback slack; otherwise add a new
entry, which flushes the previous hot slot and takes a fresh slot id.
Applied to all three save sites (batch, sequential, disaggregated
prefill server).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant