fix(core): wire text-only Gemma3 GGUF loading by glaziermag · Pull Request #1964 · EricLBuehler/mistral.rs

glaziermag · 2026-03-03T02:25:34Z

Note

Agent 4 A100 validation update (2026-05-13 UTC): classification TARGETED, feasibility FEASIBLE_NOW. On A100 base 2d4ba4f16f61e5e18be085d0dd137bc95cba038a, bartowski/google_gemma-3-1b-it-GGUF with google_gemma-3-1b-it-Q4_K_M.gguf failed before readiness with panic Unknown GGUF architecture gemma3. On PR head 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485, the same server command built, loaded the model, completed the dummy run, /v1/models reported the Gemma3 GGUF loaded, and /v1/chat/completions returned HTTP success. This validates text-only Gemma3 GGUF startup/runtime smoke, not FunctionGemma specifically and not output quality/accuracy.

Gemma3 GGUF Architecture Support

This PR implements the architecture and sliding-window mask forwarding required for text-only Gemma3 GGUF models.

Fixes in this update

Keep quantized Gemma3 Q/K/V projection outputs in their native dtype through per-head QRmsNorm, then cast Q/K/V to the model dtype before RoPE/attention. This avoids the CUDA dtype mismatch in binary op seen when the norm weights dequantize to F32.
When rope.dimension_count is absent, fall back to attention.key_length instead of embedding_length / attention.head_count. The validated 1B GGUF has embedding_length = 1152, head_count = 4, and key_length = 256; using 1152 / 4 = 288 built RoPE tables with last dim 144 and failed against 256-wide Q/K heads.
Add a focused GGUF metadata regression test for that Gemma3 1B shape.

Validation

Local machine: macOS ARM checkout, branch codex/gemma3-text-gguf
Pushed branch head tested: 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485

cargo test -p mistralrs-core quantized_gemma3 --lib
cargo fmt --all -- --check
git diff --check
cargo clippy --workspace --tests --examples -- -D warnings

Result: all passed locally.

CUDA runtime machine: GCP g2-standard-8, 1x NVIDIA L4, driver 580.126.09, CUDA 12.9 compiler Build cuda_12.9.r12.9/compiler.35813241_0, Rust 1.95.0
Model: bartowski/google_gemma-3-1b-it-GGUF (google_gemma-3-1b-it-Q4_K_M.gguf)
Pushed branch head tested: 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485

cargo test -p mistralrs-core quantized_gemma3 --lib
cargo build --release --features cuda -p mistralrs-server

Result: module tests passed on the L4 VM; CUDA release server build completed successfully.

Server command tested:

CUDA_VISIBLE_DEVICES=0 RUST_LOG=info ./target/release/mistralrs-server --port 18081 gguf \
  -m bartowski/google_gemma-3-1b-it-GGUF \
  -f google_gemma-3-1b-it-Q4_K_M.gguf

Observed startup result:

Model loaded.
git revision: 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485
Pipeline input modalities are [Text]
Pipeline output modalities are [Text]
Dummy run completed in 0.065664101s.
OpenAI-compatible server listening on http://0.0.0.0:18081.

Readiness request:

curl -fsS --max-time 2 http://127.0.0.1:18081/v1/models

Returned loaded model status:

{"id":"bartowski/google_gemma-3-1b-it-GGUF","object":"model","status":"loaded"}

Chat request:

curl -sS --fail-with-body --max-time 120 \
  -X POST http://127.0.0.1:18081/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"default","messages":[{"role":"user","content":"What is 2 plus 2? Answer with only the number."}],"max_tokens":16,"temperature":0.7}'

Returned HTTP success with completion_tokens: 11 and finish_reason: "stop".

Before/after notes

Previous L4 validation on this PR reached the Gemma3 forward path but did not start the HTTP server because it logged:

Model failed with error: internal error 'dtype mismatch in binary op'

During this fix, after resolving the dtype issue, the next runtime blocker was:

Model failed with error: inconsistent last dim size in rope [1, 4, 1, 256] [1, 144] [1, 144]

The final pushed commit above reaches readiness and accepts /v1/chat/completions for the exact 1B GGUF repro model.

Caveats

Runtime validation covered google_gemma-3-1b-it-Q4_K_M.gguf on one NVIDIA L4 only.
This is a serving/runtime smoke test, not an output-quality or accuracy validation; the smoke response generated tokens but was not semantically evaluated.
Larger Gemma3 GGUF variants were not tested in this run.

Safe wording

Fixes text-only Gemma3 GGUF startup/runtime smoke; does not validate FunctionGemma or output quality.

github-actions · 2026-03-03T02:26:49Z

Code Metrics Report

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       2         1181         1036           34          111
 CUDA                     59        17706        13869         1637         2200
 Dockerfile                1           39           22            8            9
 HTML                      2          235          197           14           24
 JavaScript               16         3580         2702          486          392
 Jinja2                    7          694          656            5           33
 JSON                     21          409          406            0            3
 Makefile                  1            6            5            0            1
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  125         8316         6808          412         1096
 Shell                     2          485          329           95           61
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1290         1124           35          131
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                105        11197            0         8067         3130
 |- BASH                  72          934          691          149           94
 |- Dockerfile             1            1            1            0            0
 |- JSON                  20          719          719            0            0
 |- PowerShell             3            3            3            0            0
 |- Python                23         1038          862           60          116
 |- Rust                  51         2048         1718           54          276
 |- TOML                   6          207          164            0           43
 |- YAML                   2            9            8            1            0
 (Total)                            16156         4166         8331         3659
─────────────────────────────────────────────────────────────────────────────────
 Rust                    547       236072       207590         6565        21917
 |- Markdown             361         8962          452         7385         1125
 (Total)                           245034       208042        13950        23042
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   961       311435       249055        28614        33766
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

glaziermag · 2026-03-10T05:21:26Z

Testing validated: PR compiled and 'mistralrs-core' test suite passed successfully on vm.

EricLBuehler

Thanks @glaziermag! I think there is a small merge conflict.

Also noticed you updated to use dtolnay/rust-toolchain 👍.

glaziermag · 2026-03-22T20:28:22Z

Just resolved the small merge conflict in .typos.toml. Everything is good to go now!

rnett · 2026-04-09T06:18:07Z

This is something that would be great to have for gemma 4, too. I was going to make an issue and still can if it would be useful.

glaziermag · 2026-04-16T22:51:24Z

Closing: this PR has had CHANGES_REQUESTED from the maintainer (@EricLBuehler) with no response, CI is failing, and it overlaps with #1932 (Gemma 3 config-based routing in AutoVisionLoader). The two PRs touch the same architectural dispatch code, creating merge hazards. If Gemma3 GGUF support is still needed, it should be coordinated with #1932 in a fresh PR.

glaziermag · 2026-04-16T22:55:20Z

Reopening — Gemma3 GGUF support is real implementation work worth preserving. Will address the maintainer's review feedback, fix CI, and coordinate with #1932 to resolve the dispatch overlap.

glaziermag · 2026-04-17T22:19:55Z

Housekeeping note: This branch currently bundles the CI fix from #2115 (.typos.toml, openapi_doc.rs, distributed/layers.rs). Once #2115 is merged, this branch will need a rebase onto updated master to drop the duplicate CI fix commit and resolve the resulting conflicts.

Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged on 2026-04-15, combined with new lints in Rust 1.95 stable: Typos: - Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory - Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore Rustfmt (1.95): - Fix import ordering in openapi_doc.rs - Reformat lines affected by other lint fixes Clippy (1.95): - useless_conversion: remove redundant .into_iter() in zip/extend calls across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers - iter_kv_map: use .into_values().flatten() in default_scheduler - manual_checked_ops: use .checked_div() in distributed/layers, video.rs, pyo3/util.rs - let_unit_value: remove unit let binding in bench.rs - dead_code: allow unused num_experts field in MoEExperts (set but unread) Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>

glaziermag · 2026-04-19T20:18:18Z

This PR needs a rebase against current master. There are 15 compilation errors from API drift:

qmethod_matmul has been renamed to qmatmul
Sequence::new_waiting signature has changed (missing seq_preallocated_cache argument)
Several struct field mismatches in the GGUF loading path

Verified on a GCP g2-standard-32 instance with Rust 1.88.0:

$ RUSTFLAGS="-A warnings" cargo check --workspace
error[E0599]: no method named `qmethod_matmul` found for struct `MatMul`
error[E0061]: this function takes 29 arguments but 28 arguments were supplied
...
error: could not compile `mistralrs-core` (lib) due to 15 previous errors

…sing

Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged on 2026-04-15, combined with new lints in Rust 1.95 stable: Typos: - Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory - Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore Rustfmt (1.95): - Fix import ordering in openapi_doc.rs - Reformat lines affected by other lint fixes Clippy (1.95): - useless_conversion: remove redundant .into_iter() in zip/extend calls across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers - iter_kv_map: use .into_values().flatten() in default_scheduler - manual_checked_ops: use .checked_div() in distributed/layers, video.rs, pyo3/util.rs - let_unit_value: remove unit let binding in bench.rs - dead_code: allow unused num_experts field in MoEExperts (set but unread) Signed-off-by: glaziermag <glaziermag@users.noreply.github.com> (cherry picked from commit 0a11bf2)

glaziermag · 2026-05-14T23:20:23Z

Agent 6 follow-up on existing A100 validation: this remains valid as a targeted Gemma3 GGUF loading fix. Classification: TARGETED; feasibility: FEASIBLE_NOW. The A100 evidence covers text-only Gemma3 GGUF startup and chat smoke behavior; it does not validate FunctionGemma or output quality. Safe wording should stay limited to text-only Gemma3 GGUF loading. Recommendation: keep open/draft for review.

glaziermag · 2026-05-18T23:38:54Z

Marked ready for review. Validation evidence and narrowed claim wording are already attached in the PR discussion/body. This PR is ready under the scoped claim described in the PR.

Ready for maintainer review under the narrowed claim in the PR body. This is a targeted/invariant fix and should not be read as full closure of the broader linked issue unless the PR body explicitly says so.

glaziermag marked this pull request as ready for review March 5, 2026 21:06

glaziermag marked this pull request as draft March 6, 2026 21:36

glaziermag marked this pull request as ready for review March 10, 2026 06:00

EricLBuehler requested changes Mar 21, 2026

View reviewed changes

glaziermag closed this Apr 16, 2026

glaziermag reopened this Apr 16, 2026

glaziermag mentioned this pull request Apr 16, 2026

fix(core): route Gemma 3 variants by parsed config in AutoVisionLoader #1932

Closed

glaziermag force-pushed the codex/gemma3-text-gguf branch from 8828353 to 3ff7513 Compare April 17, 2026 19:04

glaziermag and others added 10 commits April 17, 2026 19:31

feat(core): add text-only Gemma3 GGUF support

9f2eaa2

fix(core): align Gemma3 GGUF backend with Gemma3 semantics

3412a9f

fix(core): honor Gemma3 sliding window metadata

d6f72d7

fix(core): satisfy clippy is_multiple_of lint

03409db

fix(core): use per-layer Gemma3 GGUF cache types

2715a24

fix(core): apply Gemma3 GGUF final logit softcapping

a0a8c69

fix(core): scale Gemma3 GGUF token embeddings

bfaa519

fix(core): honor Gemma3 GGUF rope scaling metadata

f625616

fix(core): add Gemma3 GGUF back-compat fallbacks

05e476b

glaziermag force-pushed the codex/gemma3-text-gguf branch from 3ff7513 to 23604ff Compare April 18, 2026 02:31

fix(core): align gemma3 gguf mlp activation and metadata interval par…

0fedd58

…sing

glaziermag force-pushed the codex/gemma3-text-gguf branch from 23604ff to 0fedd58 Compare April 27, 2026 17:02

fix: make rope.dimension_count optional for gemma3 gguf

916f59d

glaziermag added 4 commits April 27, 2026 12:02

chore: remove unrelated files from PR and fix formatting

07e0756

fix(gguf): use Gemma3 device map sizes

0d43e96

fix(gguf): restore Gemma3 non-mapped sizing

138c52b

glaziermag changed the title ~~feat(core): add text-only Gemma3 GGUF support~~ wip(core): wire text-only Gemma3 GGUF loading Apr 28, 2026

fix(gguf): finish Gemma3 text loading

6b9bb6e

glaziermag changed the title ~~wip(core): wire text-only Gemma3 GGUF loading~~ fix(core): wire text-only Gemma3 GGUF loading Apr 28, 2026

glaziermag marked this pull request as draft May 5, 2026 19:16

glaziermag marked this pull request as ready for review May 18, 2026 23:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): wire text-only Gemma3 GGUF loading#1964

fix(core): wire text-only Gemma3 GGUF loading#1964
glaziermag wants to merge 17 commits into
EricLBuehler:masterfrom
glaziermag:codex/gemma3-text-gguf

glaziermag commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

glaziermag commented Mar 10, 2026 •

edited

Loading

Uh oh!

EricLBuehler left a comment •

edited

Loading

Uh oh!

glaziermag commented Mar 22, 2026

Uh oh!

rnett commented Apr 9, 2026

Uh oh!

glaziermag commented Apr 16, 2026

Uh oh!

glaziermag commented Apr 16, 2026

Uh oh!

glaziermag commented Apr 17, 2026

Uh oh!

glaziermag commented Apr 19, 2026

Uh oh!

glaziermag commented May 14, 2026

Uh oh!

glaziermag commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

glaziermag commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Gemma3 GGUF Architecture Support

Fixes in this update

Validation

Before/after notes

Caveats

Safe wording

Uh oh!

github-actions Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glaziermag commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EricLBuehler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glaziermag commented Mar 22, 2026

Uh oh!

rnett commented Apr 9, 2026

Uh oh!

glaziermag commented Apr 16, 2026

Uh oh!

glaziermag commented Apr 16, 2026

Uh oh!

glaziermag commented Apr 17, 2026

Uh oh!

glaziermag commented Apr 19, 2026

Uh oh!

glaziermag commented May 14, 2026

Uh oh!

glaziermag commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

glaziermag commented Mar 3, 2026 •

edited

Loading

github-actions Bot commented Mar 3, 2026 •

edited

Loading

glaziermag commented Mar 10, 2026 •

edited

Loading

EricLBuehler left a comment •

edited

Loading