Skip to content

fix(quant): avoid F8E4M3 compute dtype in FP8 dequantization#2096

Closed
glaziermag wants to merge 9 commits into
EricLBuehler:masterfrom
glaziermag:fix-fp8-unquantlinear-mismatch
Closed

fix(quant): avoid F8E4M3 compute dtype in FP8 dequantization#2096
glaziermag wants to merge 9 commits into
EricLBuehler:masterfrom
glaziermag:fix-fp8-unquantlinear-mismatch

Conversation

@glaziermag

@glaziermag glaziermag commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Fix: FP8 Unquant Linear Mismatch Crash & Dequantization Truncation

This PR fixes two fatal FP8 runtime bugs:

  1. UnquantLinear fallback crash: if an FP8 weight tensor reaches the unquantized path without a corresponding FP8 scale tensor, the old path attempted an unsupported cast and failed with unexpected dtype, expected: BF16, got: F8E4M3. The patched path now rejects this malformed state with an explicit error instead of continuing into the dtype mismatch.

  2. Dequantization truncation: BlockwiseFP8Linear, VectorFP8Linear, and distributed MoE expert paths used vb.dtype() as the dequant target. For FP8 models, vb.dtype() can be F8E4M3 (the storage type), so expanded parameters could be coerced back to 8-bit floats. The patch centralizes FP8 dequant dtype selection and keeps dequantized tensors in a compute dtype.

Changes

Centralized helper (lib.rs)

  • fp8_dequant_dtype(vb_dtype, bias): determines the output dtype for FP8 dequantization. The invariant is: never dequantize FP8 back into F8E4M3.
  • Added durable unit coverage for the invariant on branch head 698933761.

Scale name resolution (distributed/layers.rs)

  • fp8_scale_name(vb, base): resolves weight_scale_inv vs weight_scale dynamically and errors if neither exists.
  • Applied to distributed MoE paths: PackedExperts and FusedExperts, stacked and per-expert forms, covering gate_proj, up_proj, down_proj, and gate_up_proj scale lookups.

Dequant dtype fixes

  • blockwise_fp8/mod.rs: uses fp8_dequant_dtype instead of raw vb.dtype().
  • vector_fp8/mod.rs: uses fp8_dequant_dtype instead of raw vb.dtype().
  • distributed/layers.rs: blockwise_fp8_moe() calls use a compute dtype rather than FP8 storage dtype.

Missing-scale fallback (unquantized/mod.rs)

  • UnquantLinear now explicitly errors on FP8 weights without scale tensors. This is intentionally a clean rejection, not an attempted dequantization path, because the scale tensor needed to recover the values is absent.

Empirical Execution Baseline (GCP g2-standard-32 L4 GPU)

Validated on a representative dynamic FP8 model (CalamitousFelicitousness/Qwen2.5-1.5B-Instruct-fp8-dynamic).

Before (Master Branch)

echo "What is 2+2? Only answer with the number 4." | ./target/release/mistralrs run text -m CalamitousFelicitousness/Qwen2.5-1.5B-Instruct-fp8-dynamic -a qwen2 --max-seq-len 4096
2026-04-27T21:37:48.499309Z ERROR mistralrs_core::engine: step - Model failed with error: unexpected dtype, expected: BF16, got: F8E4M3
2026-04-27T21:37:48.509262Z ERROR mistralrs_core::engine: step - Model failed with error: unexpected dtype, expected: BF16, got: F8E4M3
2026-04-27T21:37:48.522146Z ERROR mistralrs_core::engine: step - Model failed with error: unexpected dtype, expected: BF16, got: F8E4M3

After (Patched Branch)

echo "What is 2+2? Only answer with the number 4." | ./target/release/mistralrs run text -m CalamitousFelicitousness/Qwen2.5-1.5B-Instruct-fp8-dynamic -a qwen2 --max-seq-len 1024 --paged-attn off
4

Stats:
Time to first token: 0.88s
Prompt: 44 tokens, 50.11 T/s
Decode: 2 tokens, 95.24 T/s
Prefix cache: 0 hits / 1 turns
Sampling: temp=0.7, top_k=20, top_p=0.8, min_p=off, rep_pen=1.1

Additional logic-vet validation (698933761)

cargo test -p mistralrs-quant fp8_dequant_dtype_never_uses_fp8_storage_dtype --lib
cargo fmt --all -- --check
git diff --check
cargo clippy --workspace --tests --examples -- -D warnings

All passed locally.

Runtime validation used Qwen2.5-1.5B-Instruct-fp8-dynamic as a representative dynamic FP8/Qwen path. The exact original 27B repro (Qwen/Qwen3.5-27B-FP8) still has not been run and likely requires an A100/H100-class GPU with sufficient VRAM.

@glaziermag glaziermag force-pushed the fix-fp8-unquantlinear-mismatch branch from fca8cc1 to e7f2e64 Compare April 17, 2026 19:05
@github-actions

github-actions Bot commented Apr 17, 2026

Copy link
Copy Markdown
Code Metrics Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       2         1181         1036           34          111
 CUDA                     59        17706        13869         1637         2200
 Dockerfile                1           39           22            8            9
 HTML                      2          235          197           14           24
 JavaScript               16         3580         2702          486          392
 Jinja2                    7          694          656            5           33
 JSON                     21          409          406            0            3
 Makefile                  1            6            5            0            1
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  125         8316         6808          412         1096
 Shell                     2          485          329           95           61
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1290         1124           35          131
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                105        11197            0         8067         3130
 |- BASH                  72          934          691          149           94
 |- Dockerfile             1            1            1            0            0
 |- JSON                  20          719          719            0            0
 |- PowerShell             3            3            3            0            0
 |- Python                23         1038          862           60          116
 |- Rust                  51         2048         1718           54          276
 |- TOML                   6          207          164            0           43
 |- YAML                   2            9            8            1            0
 (Total)                            16156         4166         8331         3659
─────────────────────────────────────────────────────────────────────────────────
 Rust                    547       236072       207590         6565        21917
 |- Markdown             361         8962          452         7385         1125
 (Total)                           245034       208042        13950        23042
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   961       311435       249055        28614        33766
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@glaziermag

Copy link
Copy Markdown
Contributor Author

Housekeeping note: This branch currently bundles the CI fix from #2115 (.typos.toml, openapi_doc.rs, distributed/layers.rs). Once #2115 is merged, this branch will need a rebase onto updated master to drop the duplicate CI fix commit and resolve the resulting conflicts.

glaziermag and others added 2 commits April 17, 2026 19:30
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged
on 2026-04-15, combined with new lints in Rust 1.95 stable:

Typos:
- Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory
- Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore

Rustfmt (1.95):
- Fix import ordering in openapi_doc.rs
- Reformat lines affected by other lint fixes

Clippy (1.95):
- useless_conversion: remove redundant .into_iter() in zip/extend calls
  across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers
- iter_kv_map: use .into_values().flatten() in default_scheduler
- manual_checked_ops: use .checked_div() in distributed/layers, video.rs,
  pyo3/util.rs
- let_unit_value: remove unit let binding in bench.rs
- dead_code: allow unused num_experts field in MoEExperts (set but unread)

Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
…inear

This fixes issue EricLBuehler#2072 where FP8-quantized weights could not be directly loaded into UnquantLinear fallback configurations due to unsupported native CUDA PTX casts in candle. UnquantLinear now safely routes F8E4M3 via scalar_fp8::ops::fp8_to_dtype.

Signed-off-by: Gabe <gabe@example.com>
@glaziermag glaziermag force-pushed the fix-fp8-unquantlinear-mismatch branch from e7f2e64 to c7ab2a2 Compare April 18, 2026 02:31
…mats

Mistral.rs expected `weight_scale_inv` for FP8 layers, which is missing in standard Hugging Face safetensors (like Qwen 2.5 and LLaMA 3). This caused FP8 layers to fallback to `UnquantLinear` unscaled, resulting in garbled output due to the naive PR 2096 `to_dtype` cast.
- Reverted the dangerous unscaled cast in `UnquantLinear` and replaced it with a clear error prompt.
- Patched `pertensor_fp8`, `blockwise_fp8`, and `vector_fp8` loaders to successfully locate and invert `weight_scale` and `input_scale` variables to correctly route native Hugging Face FP8 models through to the native FP8 dequantization kernels.
@glaziermag glaziermag force-pushed the fix-fp8-unquantlinear-mismatch branch from cad5973 to aee4ae8 Compare April 27, 2026 20:49
Add fp8_dequant_dtype() helper to prevent F8E4M3->F8E4M3 truncation
in all FP8 paths (blockwise, vector, distributed MoE).

Add fp8_scale_name() helper to resolve weight_scale_inv vs
weight_scale dynamically in all distributed MoE expert paths,
supporting both HuggingFace naming conventions.
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged
on 2026-04-15, combined with new lints in Rust 1.95 stable:

Typos:
- Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory
- Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore

Rustfmt (1.95):
- Fix import ordering in openapi_doc.rs
- Reformat lines affected by other lint fixes

Clippy (1.95):
- useless_conversion: remove redundant .into_iter() in zip/extend calls
  across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers
- iter_kv_map: use .into_values().flatten() in default_scheduler
- manual_checked_ops: use .checked_div() in distributed/layers, video.rs,
  pyo3/util.rs
- let_unit_value: remove unit let binding in bench.rs
- dead_code: allow unused num_experts field in MoEExperts (set but unread)

Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
(cherry picked from commit 0a11bf2)
@glaziermag glaziermag changed the title fix(quant): resolve dtype mismatch casting F8E4M3 to BF16 in UnquantLinear (#2072) fix(quant): avoid F8E4M3 compute dtype in FP8 dequantization Apr 28, 2026
@glaziermag

Copy link
Copy Markdown
Contributor Author

Closing this rather than continuing to carry an A100/H100-only exact validation caveat. The branch has representative L4 FP8 validation and unit coverage, but the exact original 27B A100/H100 repro was not run.

@glaziermag glaziermag closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant