fix(core): wire text-only Gemma3 GGUF loading#1964
Conversation
Code Metrics Report━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Language Files Lines Code Comments Blanks ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ C Header 5 305 210 52 43 CSS 2 1181 1036 34 111 CUDA 59 17706 13869 1637 2200 Dockerfile 1 39 22 8 9 HTML 2 235 197 14 24 JavaScript 16 3580 2702 486 392 Jinja2 7 694 656 5 33 JSON 21 409 406 0 3 Makefile 1 6 5 0 1 Metal Shading Lan| 31 11647 9007 1064 1576 PowerShell 1 300 227 30 43 Python 125 8316 6808 412 1096 Shell 2 485 329 95 61 Plain Text 3 3723 0 2413 1310 TOML 27 1290 1124 35 131 YAML 3 25 23 2 0 ───────────────────────────────────────────────────────────────────────────────── Jupyter Notebooks 3 122 83 23 16 |- Markdown 1 60 30 22 8 |- Python 1 122 113 1 8 (Total) 304 226 46 32 ───────────────────────────────────────────────────────────────────────────────── Markdown 105 11197 0 8067 3130 |- BASH 72 934 691 149 94 |- Dockerfile 1 1 1 0 0 |- JSON 20 719 719 0 0 |- PowerShell 3 3 3 0 0 |- Python 23 1038 862 60 116 |- Rust 51 2048 1718 54 276 |- TOML 6 207 164 0 43 |- YAML 2 9 8 1 0 (Total) 16156 4166 8331 3659 ───────────────────────────────────────────────────────────────────────────────── Rust 547 236072 207590 6565 21917 |- Markdown 361 8962 452 7385 1125 (Total) 245034 208042 13950 23042 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 961 311435 249055 28614 33766 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ |
|
Testing validated: PR compiled and 'mistralrs-core' test suite passed successfully on vm. |
There was a problem hiding this comment.
Thanks @glaziermag! I think there is a small merge conflict.
Also noticed you updated to use dtolnay/rust-toolchain 👍.
|
Just resolved the small merge conflict in |
|
This is something that would be great to have for gemma 4, too. I was going to make an issue and still can if it would be useful. |
|
Closing: this PR has had CHANGES_REQUESTED from the maintainer (@EricLBuehler) with no response, CI is failing, and it overlaps with #1932 (Gemma 3 config-based routing in AutoVisionLoader). The two PRs touch the same architectural dispatch code, creating merge hazards. If Gemma3 GGUF support is still needed, it should be coordinated with #1932 in a fresh PR. |
|
Reopening — Gemma3 GGUF support is real implementation work worth preserving. Will address the maintainer's review feedback, fix CI, and coordinate with #1932 to resolve the dispatch overlap. |
8828353 to
3ff7513
Compare
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged on 2026-04-15, combined with new lints in Rust 1.95 stable: Typos: - Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory - Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore Rustfmt (1.95): - Fix import ordering in openapi_doc.rs - Reformat lines affected by other lint fixes Clippy (1.95): - useless_conversion: remove redundant .into_iter() in zip/extend calls across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers - iter_kv_map: use .into_values().flatten() in default_scheduler - manual_checked_ops: use .checked_div() in distributed/layers, video.rs, pyo3/util.rs - let_unit_value: remove unit let binding in bench.rs - dead_code: allow unused num_experts field in MoEExperts (set but unread) Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
3ff7513 to
23604ff
Compare
|
This PR needs a rebase against current master. There are 15 compilation errors from API drift:
Verified on a GCP g2-standard-32 instance with Rust 1.88.0: |
23604ff to
0fedd58
Compare
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged on 2026-04-15, combined with new lints in Rust 1.95 stable: Typos: - Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory - Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore Rustfmt (1.95): - Fix import ordering in openapi_doc.rs - Reformat lines affected by other lint fixes Clippy (1.95): - useless_conversion: remove redundant .into_iter() in zip/extend calls across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers - iter_kv_map: use .into_values().flatten() in default_scheduler - manual_checked_ops: use .checked_div() in distributed/layers, video.rs, pyo3/util.rs - let_unit_value: remove unit let binding in bench.rs - dead_code: allow unused num_experts field in MoEExperts (set but unread) Signed-off-by: glaziermag <glaziermag@users.noreply.github.com> (cherry picked from commit 0a11bf2)
|
Agent 6 follow-up on existing A100 validation: this remains valid as a targeted Gemma3 GGUF loading fix. Classification: |
|
Marked ready for review. Validation evidence and narrowed claim wording are already attached in the PR discussion/body. This PR is ready under the scoped claim described in the PR. Ready for maintainer review under the narrowed claim in the PR body. This is a targeted/invariant fix and should not be read as full closure of the broader linked issue unless the PR body explicitly says so. |
Note
Agent 4 A100 validation update (2026-05-13 UTC): classification
TARGETED, feasibilityFEASIBLE_NOW. On A100 base2d4ba4f16f61e5e18be085d0dd137bc95cba038a,bartowski/google_gemma-3-1b-it-GGUFwithgoogle_gemma-3-1b-it-Q4_K_M.gguffailed before readiness with panicUnknown GGUF architecture gemma3. On PR head6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485, the same server command built, loaded the model, completed the dummy run,/v1/modelsreported the Gemma3 GGUF loaded, and/v1/chat/completionsreturned HTTP success. This validates text-only Gemma3 GGUF startup/runtime smoke, not FunctionGemma specifically and not output quality/accuracy.Gemma3 GGUF Architecture Support
This PR implements the architecture and sliding-window mask forwarding required for text-only Gemma3 GGUF models.
Fixes in this update
QRmsNorm, then cast Q/K/V to the model dtype before RoPE/attention. This avoids the CUDAdtype mismatch in binary opseen when the norm weights dequantize to F32.rope.dimension_countis absent, fall back toattention.key_lengthinstead ofembedding_length / attention.head_count. The validated 1B GGUF hasembedding_length = 1152,head_count = 4, andkey_length = 256; using1152 / 4 = 288built RoPE tables with last dim 144 and failed against 256-wide Q/K heads.Validation
Local machine: macOS ARM checkout, branch
codex/gemma3-text-ggufPushed branch head tested:
6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485cargo test -p mistralrs-core quantized_gemma3 --lib cargo fmt --all -- --check git diff --check cargo clippy --workspace --tests --examples -- -D warningsResult: all passed locally.
CUDA runtime machine: GCP
g2-standard-8, 1x NVIDIA L4, driver580.126.09, CUDA 12.9 compilerBuild cuda_12.9.r12.9/compiler.35813241_0, Rust1.95.0Model:
bartowski/google_gemma-3-1b-it-GGUF(google_gemma-3-1b-it-Q4_K_M.gguf)Pushed branch head tested:
6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485cargo test -p mistralrs-core quantized_gemma3 --lib cargo build --release --features cuda -p mistralrs-serverResult: module tests passed on the L4 VM; CUDA release server build completed successfully.
Server command tested:
Observed startup result:
Readiness request:
Returned loaded model status:
{"id":"bartowski/google_gemma-3-1b-it-GGUF","object":"model","status":"loaded"}Chat request:
Returned HTTP success with
completion_tokens: 11andfinish_reason: "stop".Before/after notes
Previous L4 validation on this PR reached the Gemma3 forward path but did not start the HTTP server because it logged:
During this fix, after resolving the dtype issue, the next runtime blocker was:
The final pushed commit above reaches readiness and accepts
/v1/chat/completionsfor the exact 1B GGUF repro model.Caveats
google_gemma-3-1b-it-Q4_K_M.ggufon one NVIDIA L4 only.Safe wording
Fixes text-only Gemma3 GGUF startup/runtime smoke; does not validate FunctionGemma or output quality.