Skip to content

fix(core): wire text-only Gemma3 GGUF loading#1964

Open
glaziermag wants to merge 17 commits into
EricLBuehler:masterfrom
glaziermag:codex/gemma3-text-gguf
Open

fix(core): wire text-only Gemma3 GGUF loading#1964
glaziermag wants to merge 17 commits into
EricLBuehler:masterfrom
glaziermag:codex/gemma3-text-gguf

Conversation

@glaziermag

@glaziermag glaziermag commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Note

Agent 4 A100 validation update (2026-05-13 UTC): classification TARGETED, feasibility FEASIBLE_NOW. On A100 base 2d4ba4f16f61e5e18be085d0dd137bc95cba038a, bartowski/google_gemma-3-1b-it-GGUF with google_gemma-3-1b-it-Q4_K_M.gguf failed before readiness with panic Unknown GGUF architecture gemma3. On PR head 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485, the same server command built, loaded the model, completed the dummy run, /v1/models reported the Gemma3 GGUF loaded, and /v1/chat/completions returned HTTP success. This validates text-only Gemma3 GGUF startup/runtime smoke, not FunctionGemma specifically and not output quality/accuracy.


Gemma3 GGUF Architecture Support

This PR implements the architecture and sliding-window mask forwarding required for text-only Gemma3 GGUF models.

Fixes in this update

  1. Keep quantized Gemma3 Q/K/V projection outputs in their native dtype through per-head QRmsNorm, then cast Q/K/V to the model dtype before RoPE/attention. This avoids the CUDA dtype mismatch in binary op seen when the norm weights dequantize to F32.
  2. When rope.dimension_count is absent, fall back to attention.key_length instead of embedding_length / attention.head_count. The validated 1B GGUF has embedding_length = 1152, head_count = 4, and key_length = 256; using 1152 / 4 = 288 built RoPE tables with last dim 144 and failed against 256-wide Q/K heads.
  3. Add a focused GGUF metadata regression test for that Gemma3 1B shape.

Validation

Local machine: macOS ARM checkout, branch codex/gemma3-text-gguf
Pushed branch head tested: 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485

cargo test -p mistralrs-core quantized_gemma3 --lib
cargo fmt --all -- --check
git diff --check
cargo clippy --workspace --tests --examples -- -D warnings

Result: all passed locally.

CUDA runtime machine: GCP g2-standard-8, 1x NVIDIA L4, driver 580.126.09, CUDA 12.9 compiler Build cuda_12.9.r12.9/compiler.35813241_0, Rust 1.95.0
Model: bartowski/google_gemma-3-1b-it-GGUF (google_gemma-3-1b-it-Q4_K_M.gguf)
Pushed branch head tested: 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485

cargo test -p mistralrs-core quantized_gemma3 --lib
cargo build --release --features cuda -p mistralrs-server

Result: module tests passed on the L4 VM; CUDA release server build completed successfully.

Server command tested:

CUDA_VISIBLE_DEVICES=0 RUST_LOG=info ./target/release/mistralrs-server --port 18081 gguf \
  -m bartowski/google_gemma-3-1b-it-GGUF \
  -f google_gemma-3-1b-it-Q4_K_M.gguf

Observed startup result:

Model loaded.
git revision: 6b9bb6efb5dc9b24d4951024e6d4cac7d65a7485
Pipeline input modalities are [Text]
Pipeline output modalities are [Text]
Dummy run completed in 0.065664101s.
OpenAI-compatible server listening on http://0.0.0.0:18081.

Readiness request:

curl -fsS --max-time 2 http://127.0.0.1:18081/v1/models

Returned loaded model status:

{"id":"bartowski/google_gemma-3-1b-it-GGUF","object":"model","status":"loaded"}

Chat request:

curl -sS --fail-with-body --max-time 120 \
  -X POST http://127.0.0.1:18081/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"default","messages":[{"role":"user","content":"What is 2 plus 2? Answer with only the number."}],"max_tokens":16,"temperature":0.7}'

Returned HTTP success with completion_tokens: 11 and finish_reason: "stop".

Before/after notes

Previous L4 validation on this PR reached the Gemma3 forward path but did not start the HTTP server because it logged:

Model failed with error: internal error 'dtype mismatch in binary op'

During this fix, after resolving the dtype issue, the next runtime blocker was:

Model failed with error: inconsistent last dim size in rope [1, 4, 1, 256] [1, 144] [1, 144]

The final pushed commit above reaches readiness and accepts /v1/chat/completions for the exact 1B GGUF repro model.

Caveats

  • Runtime validation covered google_gemma-3-1b-it-Q4_K_M.gguf on one NVIDIA L4 only.
  • This is a serving/runtime smoke test, not an output-quality or accuracy validation; the smoke response generated tokens but was not semantically evaluated.
  • Larger Gemma3 GGUF variants were not tested in this run.

Safe wording

Fixes text-only Gemma3 GGUF startup/runtime smoke; does not validate FunctionGemma or output quality.

@github-actions

github-actions Bot commented Mar 3, 2026

Copy link
Copy Markdown
Code Metrics Report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Language              Files        Lines         Code     Comments       Blanks
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 C Header                  5          305          210           52           43
 CSS                       2         1181         1036           34          111
 CUDA                     59        17706        13869         1637         2200
 Dockerfile                1           39           22            8            9
 HTML                      2          235          197           14           24
 JavaScript               16         3580         2702          486          392
 Jinja2                    7          694          656            5           33
 JSON                     21          409          406            0            3
 Makefile                  1            6            5            0            1
 Metal Shading Lan|       31        11647         9007         1064         1576
 PowerShell                1          300          227           30           43
 Python                  125         8316         6808          412         1096
 Shell                     2          485          329           95           61
 Plain Text                3         3723            0         2413         1310
 TOML                     27         1290         1124           35          131
 YAML                      3           25           23            2            0
─────────────────────────────────────────────────────────────────────────────────
 Jupyter Notebooks         3          122           83           23           16
 |- Markdown               1           60           30           22            8
 |- Python                 1          122          113            1            8
 (Total)                              304          226           46           32
─────────────────────────────────────────────────────────────────────────────────
 Markdown                105        11197            0         8067         3130
 |- BASH                  72          934          691          149           94
 |- Dockerfile             1            1            1            0            0
 |- JSON                  20          719          719            0            0
 |- PowerShell             3            3            3            0            0
 |- Python                23         1038          862           60          116
 |- Rust                  51         2048         1718           54          276
 |- TOML                   6          207          164            0           43
 |- YAML                   2            9            8            1            0
 (Total)                            16156         4166         8331         3659
─────────────────────────────────────────────────────────────────────────────────
 Rust                    547       236072       207590         6565        21917
 |- Markdown             361         8962          452         7385         1125
 (Total)                           245034       208042        13950        23042
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Total                   961       311435       249055        28614        33766
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

@glaziermag glaziermag marked this pull request as ready for review March 5, 2026 21:06
@glaziermag glaziermag marked this pull request as draft March 6, 2026 21:36
@glaziermag

glaziermag commented Mar 10, 2026

Copy link
Copy Markdown
Contributor Author

Testing validated: PR compiled and 'mistralrs-core' test suite passed successfully on vm.

@glaziermag glaziermag marked this pull request as ready for review March 10, 2026 06:00

@EricLBuehler EricLBuehler left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @glaziermag! I think there is a small merge conflict.

Also noticed you updated to use dtolnay/rust-toolchain 👍.

@glaziermag

Copy link
Copy Markdown
Contributor Author

Just resolved the small merge conflict in .typos.toml. Everything is good to go now!

@rnett

rnett commented Apr 9, 2026

Copy link
Copy Markdown

This is something that would be great to have for gemma 4, too. I was going to make an issue and still can if it would be useful.

@glaziermag

Copy link
Copy Markdown
Contributor Author

Closing: this PR has had CHANGES_REQUESTED from the maintainer (@EricLBuehler) with no response, CI is failing, and it overlaps with #1932 (Gemma 3 config-based routing in AutoVisionLoader). The two PRs touch the same architectural dispatch code, creating merge hazards. If Gemma3 GGUF support is still needed, it should be coordinated with #1932 in a fresh PR.

@glaziermag glaziermag closed this Apr 16, 2026
@glaziermag

Copy link
Copy Markdown
Contributor Author

Reopening — Gemma3 GGUF support is real implementation work worth preserving. Will address the maintainer's review feedback, fix CI, and coordinate with #1932 to resolve the dispatch overlap.

@glaziermag

Copy link
Copy Markdown
Contributor Author

Housekeeping note: This branch currently bundles the CI fix from #2115 (.typos.toml, openapi_doc.rs, distributed/layers.rs). Once #2115 is merged, this branch will need a rebase onto updated master to drop the duplicate CI fix commit and resolve the resulting conflicts.

glaziermag and others added 10 commits April 17, 2026 19:31
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged
on 2026-04-15, combined with new lints in Rust 1.95 stable:

Typos:
- Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory
- Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore

Rustfmt (1.95):
- Fix import ordering in openapi_doc.rs
- Reformat lines affected by other lint fixes

Clippy (1.95):
- useless_conversion: remove redundant .into_iter() in zip/extend calls
  across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers
- iter_kv_map: use .into_values().flatten() in default_scheduler
- manual_checked_ops: use .checked_div() in distributed/layers, video.rs,
  pyo3/util.rs
- let_unit_value: remove unit let binding in bench.rs
- dead_code: allow unused num_experts field in MoEExperts (set but unread)

Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
@glaziermag glaziermag force-pushed the codex/gemma3-text-gguf branch from 3ff7513 to 23604ff Compare April 18, 2026 02:31
@glaziermag

Copy link
Copy Markdown
Contributor Author

This PR needs a rebase against current master. There are 15 compilation errors from API drift:

  • qmethod_matmul has been renamed to qmatmul
  • Sequence::new_waiting signature has changed (missing seq_preallocated_cache argument)
  • Several struct field mismatches in the GGUF loading path

Verified on a GCP g2-standard-32 instance with Rust 1.88.0:

$ RUSTFLAGS="-A warnings" cargo check --workspace
error[E0599]: no method named `qmethod_matmul` found for struct `MatMul`
error[E0061]: this function takes 29 arguments but 28 arguments were supplied
...
error: could not compile `mistralrs-core` (lib) due to 15 previous errors

@glaziermag glaziermag force-pushed the codex/gemma3-text-gguf branch from 23604ff to 0fedd58 Compare April 27, 2026 17:02
Fixes CI failures introduced by EricLBuehler#2109 (fast CUDA MMQ GGUF kernels) merged
on 2026-04-15, combined with new lints in Rust 1.95 stable:

Typos:
- Exclude vendored mistralrs-quant/kernels/mmq_gguf/ directory
- Add CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK_OPTIN to ignore

Rustfmt (1.95):
- Fix import ordering in openapi_doc.rs
- Reformat lines affected by other lint fixes

Clippy (1.95):
- useless_conversion: remove redundant .into_iter() in zip/extend calls
  across tool_dispatch, rag, llava, idefics3, gemma4, distributed/layers
- iter_kv_map: use .into_values().flatten() in default_scheduler
- manual_checked_ops: use .checked_div() in distributed/layers, video.rs,
  pyo3/util.rs
- let_unit_value: remove unit let binding in bench.rs
- dead_code: allow unused num_experts field in MoEExperts (set but unread)

Signed-off-by: glaziermag <glaziermag@users.noreply.github.com>
(cherry picked from commit 0a11bf2)
@glaziermag glaziermag changed the title feat(core): add text-only Gemma3 GGUF support wip(core): wire text-only Gemma3 GGUF loading Apr 28, 2026
@glaziermag glaziermag changed the title wip(core): wire text-only Gemma3 GGUF loading fix(core): wire text-only Gemma3 GGUF loading Apr 28, 2026
@glaziermag glaziermag marked this pull request as draft May 5, 2026 19:16

Copy link
Copy Markdown
Contributor Author

Agent 6 follow-up on existing A100 validation: this remains valid as a targeted Gemma3 GGUF loading fix. Classification: TARGETED; feasibility: FEASIBLE_NOW. The A100 evidence covers text-only Gemma3 GGUF startup and chat smoke behavior; it does not validate FunctionGemma or output quality. Safe wording should stay limited to text-only Gemma3 GGUF loading. Recommendation: keep open/draft for review.

@glaziermag glaziermag marked this pull request as ready for review May 18, 2026 23:38
@glaziermag

Copy link
Copy Markdown
Contributor Author

Marked ready for review. Validation evidence and narrowed claim wording are already attached in the PR discussion/body. This PR is ready under the scoped claim described in the PR.

Ready for maintainer review under the narrowed claim in the PR body. This is a targeted/invariant fix and should not be read as full closure of the broader linked issue unless the PR body explicitly says so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants