fix(metal): back zero-element KV cache buffers with a shared placeholder by sergey-scherbina · Pull Request #2206 · EricLBuehler/mistral.rs

sergey-scherbina · 2026-06-11T06:38:44Z

What

PagedAttention allocates a per-layer KV cache block buffer sized to the layer's
KV. Hybrid models have layers with no KV (the linear-attention / GatedDeltaNet
layers carry a recurrent state instead), so that size is 0. On Metal,
new_private_buffer(0) produces a zero-length buffer that later indexing treats as
invalid. Back those zero-element buffers with a shared 1-element placeholder
(elem_count.max(1)) so the no-KV layers allocate something valid and are simply
never read as KV.

Why

Without this, any hybrid model (e.g. Qwen3.6 qwen3_5_moe) crashes on Metal as soon
as the cache engine sets up the no-KV layers. The change is a harmless general
hardening for the dense path (a layer with KV is unaffected; max(1) is a no-op
there).

Scope

mistralrs-core/src/paged_attention/cache_engine.rs, +22/-5. Self-contained.

This is a prerequisite for #2201 (Qwen3.6 on Metal). It is split out as its own small
PR for reviewability; suggested merge order: this + the engine-reap fix, then #2201,
then the chunked-prefill PR.

Part of splitting the Qwen3.6 work into focused, reviewable PRs:

fix(metal): back zero-element KV cache buffers with a shared placeholder #2206 - zero-element KV cache buffer (Metal prerequisite for Fix Qwen3.6 (qwen3_5 / qwen3_5_moe) on Metal: RMSNorm, AFQ, lm_head, hybrid KV cache #2201)
fix(engine): drop disconnected sequences before the prefill pass #2207 - reap disconnected sequences before prefill (independent)
Fix Qwen3.6 (qwen3_5 / qwen3_5_moe) on Metal: RMSNorm, AFQ, lm_head, hybrid KV cache #2201 - Qwen3.6 (qwen3_5 / qwen3_5_moe) model support
feat(metal): paged chunked prefill, env-tunable with a safe chunk-size floor #2208 - Metal paged chunked prefill

Suggested merge order: #2206 + #2207 -> #2201 -> #2208.

Hybrid models (GDN linear-attention + sparse full-attention) produce a 0-element KV cache tensor for no-KV layers. Metal rejects newBufferWithLength:0, so loading Qwen3.6-35B-A3B failed with 'Failed to create metal resource: Buffer'. Route the k/v allocations through a closure that hands all such layers a clone of one lazily-created 1-element placeholder (never read); the tensor shape stays 0-dim.

github-actions · 2026-06-11T06:39:56Z

Code Metrics Report

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                 23         4454         3116          790          548
 CSS                       3          281          252            5           24
 CUDA                    119        23575        19136         1696         2743
 Dockerfile                1           38           21            8            9
 HTML                      2           27           27            0            0
 JavaScript                3          392          387            2            3
 Jinja2                    7          694          656            5           33
 JSON                     26         9360         9357            0            3
 Makefile                  1            6            5            0            1
 MDX                       1          149            0          133           16
 Metal Shading Lan|       37        14287        11284         1136         1867
 PowerShell                1          357          276           33           48
 Python                  131        10342         8515          460         1367
 Shell                     2          549          379          101           69
 Plain Text                3         3723            0         2413         1310
 TOML                     29         1388         1211           41          136
 TypeScript               11         1607         1371           66          170
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                129         9703            0         6648         3055
 |- BASH                  61          600          520           47           33
 |- Dockerfile             2            5            5            0            0
 |- JSON                  18          700          700            0            0
 |- PowerShell             3            5            5            0            0
 |- Python                25          830          722            5          103
 |- Rust                  15          437          382            1           54
 |- TOML                  10          124           98            3           23
 |- YAML                   1           13           13            0            0
 (Total)                            12417         2445         6704         3268
─────────────────────────────────────────────────────────────────────────────────
 Rust                    625       270388       239956         5864        24568
 |- Markdown             397         9504          452         7882         1170
 (Total)                           279892       240408        13746        25738
─────────────────────────────────────────────────────────────────────────────────
 Svelte                   18         1831         1696           50           85
 |- CSS                    1            4            4            0            0
 |- JavaScript            18          876          727           24          125
 (Total)                             2711         2427           74          210
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                  1178       366578       301522        27461        37595
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

This was referenced Jun 11, 2026

fix(engine): drop disconnected sequences before the prefill pass #2207

Open

feat(metal): paged chunked prefill, env-tunable with a safe chunk-size floor #2208

Open

Fix Qwen3.6 (qwen3_5 / qwen3_5_moe) on Metal: RMSNorm, AFQ, lm_head, hybrid KV cache #2201

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metal): back zero-element KV cache buffers with a shared placeholder#2206

fix(metal): back zero-element KV cache buffers with a shared placeholder#2206
sergey-scherbina wants to merge 1 commit into
EricLBuehler:masterfrom
sergey-scherbina:metal-zero-kv-buffer

sergey-scherbina commented Jun 11, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sergey-scherbina commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Scope

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sergey-scherbina commented Jun 11, 2026 •

edited

Loading