fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes by emanueleDiVizio · Pull Request #2047 · EricLBuehler/mistral.rs

emanueleDiVizio · 2026-04-02T17:15:59Z

Summary

This PR fixes multiple correctness, performance, and stability issues encountered while running mistral.rs on Apple Silicon (M-series) with real multi-user inference workloads (Qwen3.5 MoE + Mixtral).

The changes focus on:

Metal backend correctness (GDN + KV cache)
Scheduler behaviour under load (PagedAttention)
Robustness in concurrent serving scenarios
MLX integration improvements for attention kernels

Several of these issues only surface under concurrent decode or long-running sessions.

Key changes

Scheduler (from upstream PRs #2031/#2034)

Fix O(N²) thrashing in PagedAttention scheduler under mixed waiting/active workloads
Introduce FCFS priority ordering to prevent starvation

GDN / Metal

Fix dtype mismatch (bfloat vs bfloat16_t) in Metal kernels
Add per-sequence fallback for concurrent decode when recurrent offsets diverge

Stability

Replace panic on client disconnect with error handling
Return error instead of panic on block allocation failure (race condition)

Performance / Features

Increase Metal KV cache default max_seq_len (4K → 16K)
Add optional MLX SDPA backend with Metal flash attention (head_dim=256 support)

Test plan

Validated on Apple Silicon (M-series)
Tested with Qwen3.5 MoE (GDN) and Mixtral
Scheduler fixes verified under concurrent request load

… cache

…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.

…ts diverge

…allocation

…l prefill Add an optional MLX SDPA backend using steel flash attention kernels for Metal prefill. Enable head_dim=256 support for models like Qwen3.5 that use larger attention head dimensions.

The GDN (Gated Delta Net) kernels in mistralrs-core/src/metal/kernels/gdn.metal instantiate templates with `bfloat16_t` — the ggml/llama.cpp convention inherited from the CUDA port. Apple Metal's stdlib exposes the IEEE bfloat16 type as `bfloat` (Metal 3.1+, macOS 14+), with `bfloat16` also present as a forward-only struct (`__Reserved_Name__Do_not_use_bfloat16`) in the extended_vector private header. Neither name is `bfloat16_t`. The precompiled metallib path works because it's built against build-time headers that may resolve `bfloat16_t` differently. But the runtime-compiled path (used when MISTRALRS_METAL_PRECOMPILE=0, the only working mode on Apple Silicon GPUs whose precompiled-metallib function variants don't match the current device) bails with: error: unknown type name 'bfloat16_t'; did you mean 'bfloat16'? instantiate_conv1d_update(bfloat16_t); This blocks Qwen 3.5 / 3.6 inference on Metal under precompile=0 since those models hit the GDN path for their hybrid (FullAttention + LinearAttention) layers — Gemma 4 and dense-only models are unaffected. Fix: typedef `bfloat16_t` to the public `bfloat`. Aliasing to `bfloat16` also resolves the original parse error but then explodes at template instantiation with "incomplete type" errors because the extended_vector declaration is forward-only. Tested end-to-end with `Qwen/Qwen3.6-27B` + AFQ4 ISQ on Apple Silicon / macOS 26 — GDN kernels compile cleanly and inference produces tokens. Note: this fix overlaps with PR EricLBuehler#2047 (fix/metal: GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes) which has been open since 2026-04-02 and contains the same typedef plus several unrelated improvements. Either this fix or that PR (whichever lands first) closes the precompile=0 GDN gap; happy to drop this commit if EricLBuehler#2047 is prioritized.

emanueleDiVizio added 9 commits April 2, 2026 19:08

fix(metal): use bfloat instead of bfloat16_t in GDN Metal kernels

642243d

fix(metal): add include guard to float8.metal for PagedAttention

8d03002

feat(metal): increase default max_seq_len from 4K to 16K for Metal KV…

ad3cb23

… cache

fix(paged_attention): fix O(N^2) thrashing + FCFS priority in PA sche…

5e7dad2

…duler Reapply upstream fixes from PRs EricLBuehler#2031/EricLBuehler#2034: fix quadratic scheduling complexity when sequences are waiting, and add FCFS priority ordering to prevent starvation.

fix: don't panic when sending error response to disconnected client

27e79fa

fix: GDN concurrent decode per-sequence fallback when recurrent offse…

a3c2db0

…ts diverge

fix(paged_attention): return error instead of panic on missing block …

4e3e9e4

…allocation

feat(metal): add MLX SDPA backend with steel flash attention for Meta…

f5098ed

…l prefill Add an optional MLX SDPA backend using steel flash attention kernels for Metal prefill. Enable head_dim=256 support for models like Qwen3.5 that use larger attention head dimensions.

feat: derive Clone for Model (wraps Arc, cheap clone)

af3e7f0

emanueleDiVizio force-pushed the fix/metal-fixes branch from 3bbe0a9 to af3e7f0 Compare April 2, 2026 17:35

ljchang mentioned this pull request May 25, 2026

fix(metal): register PR #2166 kernels in runtime-compile path #2169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes#2047

fix(metal): GDN bfloat16, PA scheduler, error handling, MLX SDPA fixes#2047
emanueleDiVizio wants to merge 9 commits into
EricLBuehler:masterfrom
emanueleDiVizio:fix/metal-fixes

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emanueleDiVizio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Scheduler (from upstream PRs #2031/#2034)

GDN / Metal

Stability

Performance / Features

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

emanueleDiVizio commented Apr 2, 2026 •

edited

Loading