fix(metal): back zero-element KV cache buffers with a shared placeholder#2206
Open
sergey-scherbina wants to merge 1 commit into
Open
fix(metal): back zero-element KV cache buffers with a shared placeholder#2206sergey-scherbina wants to merge 1 commit into
sergey-scherbina wants to merge 1 commit into
Conversation
Hybrid models (GDN linear-attention + sparse full-attention) produce a 0-element KV cache tensor for no-KV layers. Metal rejects newBufferWithLength:0, so loading Qwen3.6-35B-A3B failed with 'Failed to create metal resource: Buffer'. Route the k/v allocations through a closure that hands all such layers a clone of one lazily-created 1-element placeholder (never read); the tensor shape stays 0-dim.
Code Metrics Report━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Language Files Lines Code Comments Blanks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ C Header 23 4454 3116 790 548 CSS 3 281 252 5 24 CUDA 119 23575 19136 1696 2743 Dockerfile 1 38 21 8 9 HTML 2 27 27 0 0 JavaScript 3 392 387 2 3 Jinja2 7 694 656 5 33 JSON 26 9360 9357 0 3 Makefile 1 6 5 0 1 MDX 1 149 0 133 16 Metal Shading Lan| 37 14287 11284 1136 1867 PowerShell 1 357 276 33 48 Python 131 10342 8515 460 1367 Shell 2 549 379 101 69 Plain Text 3 3723 0 2413 1310 TOML 29 1388 1211 41 136 TypeScript 11 1607 1371 66 170 YAML 3 25 23 2 0 ───────────────────────────────────────────────────────────────────────────────── Jupyter Notebooks 3 122 83 23 16 |- Markdown 1 60 30 22 8 |- Python 1 122 113 1 8 (Total) 304 226 46 32 ───────────────────────────────────────────────────────────────────────────────── Markdown 129 9703 0 6648 3055 |- BASH 61 600 520 47 33 |- Dockerfile 2 5 5 0 0 |- JSON 18 700 700 0 0 |- PowerShell 3 5 5 0 0 |- Python 25 830 722 5 103 |- Rust 15 437 382 1 54 |- TOML 10 124 98 3 23 |- YAML 1 13 13 0 0 (Total) 12417 2445 6704 3268 ───────────────────────────────────────────────────────────────────────────────── Rust 625 270388 239956 5864 24568 |- Markdown 397 9504 452 7882 1170 (Total) 279892 240408 13746 25738 ───────────────────────────────────────────────────────────────────────────────── Svelte 18 1831 1696 50 85 |- CSS 1 4 4 0 0 |- JavaScript 18 876 727 24 125 (Total) 2711 2427 74 210 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 1178 366578 301522 27461 37595 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
PagedAttention allocates a per-layer KV cache block buffer sized to the layer's
KV. Hybrid models have layers with no KV (the linear-attention / GatedDeltaNet
layers carry a recurrent state instead), so that size is
0. On Metal,new_private_buffer(0)produces a zero-length buffer that later indexing treats asinvalid. Back those zero-element buffers with a shared 1-element placeholder
(
elem_count.max(1)) so the no-KV layers allocate something valid and are simplynever read as KV.
Why
Without this, any hybrid model (e.g. Qwen3.6
qwen3_5_moe) crashes on Metal as soonas the cache engine sets up the no-KV layers. The change is a harmless general
hardening for the dense path (a layer with KV is unaffected;
max(1)is a no-opthere).
Scope
mistralrs-core/src/paged_attention/cache_engine.rs, +22/-5. Self-contained.This is a prerequisite for #2201 (Qwen3.6 on Metal). It is split out as its own small
PR for reviewability; suggested merge order: this + the engine-reap fix, then #2201,
then the chunked-prefill PR.
Part of splitting the Qwen3.6 work into focused, reviewable PRs:
Suggested merge order: #2206 + #2207 -> #2201 -> #2208.