fix(quant): avoid F8E4M3 compute dtype in FP8 dequantization#2096
Closed
glaziermag wants to merge 9 commits into
Closed
fix(quant): avoid F8E4M3 compute dtype in FP8 dequantization#2096glaziermag wants to merge 9 commits into
glaziermag wants to merge 9 commits into
Conversation
fca8cc1 to
e7f2e64
Compare
Code Metrics Report━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Language Files Lines Code Comments Blanks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ C Header 5 305 210 52 43 CSS 2 1181 1036 34 111 CUDA 59 17706 13869 1637 2200 Dockerfile 1 39 22 8 9 HTML 2 235 197 14 24 JavaScript 16 3580 2702 486 392 Jinja2 7 694 656 5 33 JSON 21 409 406 0 3 Makefile 1 6 5 0 1 Metal Shading Lan| 31 11647 9007 1064 1576 PowerShell 1 300 227 30 43 Python 125 8316 6808 412 1096 Shell 2 485 329 95 61 Plain Text 3 3723 0 2413 1310 TOML 27 1290 1124 35 131 YAML 3 25 23 2 0 ───────────────────────────────────────────────────────────────────────────────── Jupyter Notebooks 3 122 83 23 16 |- Markdown 1 60 30 22 8 |- Python 1 122 113 1 8 (Total) 304 226 46 32 ───────────────────────────────────────────────────────────────────────────────── Markdown 105 11197 0 8067 3130 |- BASH 72 934 691 149 94 |- Dockerfile 1 1 1 0 0 |- JSON 20 719 719 0 0 |- PowerShell 3 3 3 0 0 |- Python 23 1038 862 60 116 |- Rust 51 2048 1718 54 276 |- TOML 6 207 164 0 43 |- YAML 2 9 8 1 0 (Total) 16156 4166 8331 3659 ───────────────────────────────────────────────────────────────────────────────── Rust 547 236072 207590 6565 21917 |- Markdown 361 8962 452 7385 1125 (Total) 245034 208042 13950 23042 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 961 311435 249055 28614 33766 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
Contributor
Author
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged on 2026-04-15, combined with new lints in Rust 1.95 stable: Typos: - Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory - Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore Rustfmt (1.95): - Fix import ordering in openapi_doc.rs - Reformat lines affected by other lint fixes Clippy (1.95): - useless_conversion: remove redundant .into_iter() in zip/extend calls across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers - iter_kv_map: use .into_values().flatten() in default_scheduler - manual_checked_ops: use .checked_div() in distributed/layers, video.rs, pyo3/util.rs - let_unit_value: remove unit let binding in bench.rs - dead_code: allow unused num_experts field in MoEExperts (set but unread) Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
…inear This fixes issue EricLBuehler#2072 where FP8-quantized weights could not be directly loaded into UnquantLinear fallback configurations due to unsupported native CUDA PTX casts in candle. UnquantLinear now safely routes F8E4M3 via scalar_fp8::ops::fp8_to_dtype. Signed-off-by: Gabe <gabe@example.com>
e7f2e64 to
c7ab2a2
Compare
…mats Mistral.rs expected `weight_scale_inv` for FP8 layers, which is missing in standard Hugging Face safetensors (like Qwen 2.5 and LLaMA 3). This caused FP8 layers to fallback to `UnquantLinear` unscaled, resulting in garbled output due to the naive PR 2096 `to_dtype` cast. - Reverted the dangerous unscaled cast in `UnquantLinear` and replaced it with a clear error prompt. - Patched `pertensor_fp8`, `blockwise_fp8`, and `vector_fp8` loaders to successfully locate and invert `weight_scale` and `input_scale` variables to correctly route native Hugging Face FP8 models through to the native FP8 dequantization kernels.
cad5973 to
aee4ae8
Compare
Add fp8_dequant_dtype() helper to prevent F8E4M3->F8E4M3 truncation in all FP8 paths (blockwise, vector, distributed MoE). Add fp8_scale_name() helper to resolve weight_scale_inv vs weight_scale dynamically in all distributed MoE expert paths, supporting both HuggingFace naming conventions.
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged on 2026-04-15, combined with new lints in Rust 1.95 stable: Typos: - Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory - Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore Rustfmt (1.95): - Fix import ordering in openapi_doc.rs - Reformat lines affected by other lint fixes Clippy (1.95): - useless_conversion: remove redundant .into_iter() in zip/extend calls across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers - iter_kv_map: use .into_values().flatten() in default_scheduler - manual_checked_ops: use .checked_div() in distributed/layers, video.rs, pyo3/util.rs - let_unit_value: remove unit let binding in bench.rs - dead_code: allow unused num_experts field in MoEExperts (set but unread) Signed-off-by: glaziermag <glaziermag@users.noreply.github.com> (cherry picked from commit 0a11bf2)
Contributor
Author
|
Closing this rather than continuing to carry an A100/H100-only exact validation caveat. The branch has representative L4 FP8 validation and unit coverage, but the exact original 27B A100/H100 repro was not run. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix: FP8 Unquant Linear Mismatch Crash & Dequantization Truncation
This PR fixes two fatal FP8 runtime bugs:
UnquantLinear fallback crash: if an FP8
weighttensor reaches the unquantized path without a corresponding FP8 scale tensor, the old path attempted an unsupported cast and failed withunexpected dtype, expected: BF16, got: F8E4M3. The patched path now rejects this malformed state with an explicit error instead of continuing into the dtype mismatch.Dequantization truncation:
BlockwiseFP8Linear,VectorFP8Linear, and distributed MoE expert paths usedvb.dtype()as the dequant target. For FP8 models,vb.dtype()can beF8E4M3(the storage type), so expanded parameters could be coerced back to 8-bit floats. The patch centralizes FP8 dequant dtype selection and keeps dequantized tensors in a compute dtype.Changes
Centralized helper (
lib.rs)fp8_dequant_dtype(vb_dtype, bias): determines the output dtype for FP8 dequantization. The invariant is: never dequantize FP8 back intoF8E4M3.698933761.Scale name resolution (
distributed/layers.rs)fp8_scale_name(vb, base): resolvesweight_scale_invvsweight_scaledynamically and errors if neither exists.PackedExpertsandFusedExperts, stacked and per-expert forms, coveringgate_proj,up_proj,down_proj, andgate_up_projscale lookups.Dequant dtype fixes
blockwise_fp8/mod.rs: usesfp8_dequant_dtypeinstead of rawvb.dtype().vector_fp8/mod.rs: usesfp8_dequant_dtypeinstead of rawvb.dtype().distributed/layers.rs:blockwise_fp8_moe()calls use a compute dtype rather than FP8 storage dtype.Missing-scale fallback (
unquantized/mod.rs)UnquantLinearnow explicitly errors on FP8 weights without scale tensors. This is intentionally a clean rejection, not an attempted dequantization path, because the scale tensor needed to recover the values is absent.Empirical Execution Baseline (GCP
g2-standard-32L4 GPU)Validated on a representative dynamic FP8 model (
CalamitousFelicitousness/Qwen2.5-1.5B-Instruct-fp8-dynamic).Before (Master Branch)
After (Patched Branch)
Additional logic-vet validation (
698933761)cargo test -p mistralrs-quant fp8_dequant_dtype_never_uses_fp8_storage_dtype --lib cargo fmt --all -- --check git diff --check cargo clippy --workspace --tests --examples -- -D warningsAll passed locally.