Gemma4 MTP by am17an · Pull Request #17 · am17an/llama.cpp

am17an · 2026-05-19T15:56:42Z

Works with both gemma-31B and gemma-26B but the MoE model is slower. I see a good speed up on my DGX spark (~2-2.5x speedup) on the dense model. The main problem is sharing the memory ctx between the two llama_contexts, so currently it's pretty hacky plus also the ubatch splitting is not super clean.

Replicated the AIME-26 results for Gemma-31B with -np 4

am17an · 2026-05-19T16:58:01Z

+    // of streams (one per active draft seq); q->ne[2] is not divisible by the full
+    // n_stream and the view collapses tokens. Slice k/v down to exactly the streams
+    // referenced by this ubatch. Requires those streams to form a contiguous range.
+    if (k->ne[3] > 1 && (uint32_t) k->ne[3] != ubatch.n_seqs_unq) {


@ggerganov this part

ggerganov · 2026-05-21T08:47:44Z

@am17an Are these AIME results with default thinking, or did you set a reasoning budget?

am17an · 2026-05-21T08:55:00Z

Just the default, no budget

alexzk1 · 2026-06-02T18:04:55Z

I tried this branch with mtp model from here (merged to latest release)
https://huggingface.co/ironbcc/gemma-4-26B-A4B-it-MTP-GGUF
I use AMD Ryzen 7 8745H w/ Radeon 780M Graphics / Vulcan.

Well, model on master answers with 8.5 t/s with 45000 context.
When I add MTP, speed drops to 7.5 t/s.
Also, master model does thinking in English, and than answers on original language. MTP model immediately answers in English without thinking, also text appears "not fluent" (2-3 words at once).

* webui: added single line reasoning preview. * patch: reduce width slightly for the previewing section * refactor: move formatter constants to the right file * feat: reimplement reasoning preview with throttled dynamic per-line rendering * chore: fix spacing Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: refactor to requested changes * refactor: grouped by capture pattern instead of block-level + inline * ui: fax interrupt state only trigger for 1st reasoning message * chore: make reasoning preview respects showThoughtInProgress setting * chore; newline at EOF Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: thread rawContent so collapsible content can handle compute preview * patch: showThoughtInProgress accidentally blocks rawContent being passed * chore: fix lint * chore: change smoke test --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore(ui): pin package versions to currently installed - Update all dependencies and devDependencies to match exactly what's in package-lock.json - This ensures reproducible builds by locking to specific versions rather than semver ranges * chore: Update packages * chore: Move remaining dependencies to devDependencies * fix: Add missing `mermaid` package * chore: Update `cookie` package to `v1.1.1` * chore: Formatting * test: Update test configs

…gml-org#22445) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata

…debar (ggml-org#23132) * use child snippets for landing and chat message elements * make ... icon visible in conversation history menu * conversation history forward tab fix * add snippet fix for fork icon in conversation history * focus/keyboard fix for attachment x icon and scroll left/right * formatting * fix scroll down issue * simply Statistics and pointer events in scrolldown * create storybook tests and move to folder * improve tests to actually assert on element

mmvq: Port the ncols_dst optimization from ggml-cuda/mmvq.cu to SYCL. Read weights once per dispatch instead of once per column. Covers all standard quant types + reorder paths for Q4_0, Q8_0, Q3_K, Q4_K, Q5_K, Q6_K. IQ types (except IQ4_XS) excluded due to incompatible vec_dot signatures. ggml-sycl: The weight reorder was only bootstrapped on single-token mat-vec (ne[1] == 1). Speculative / MTP verify issues only multi-column mat-vec, so it never triggered the reorder and ran on the slower non-reorder kernel. Bootstrap it on small multi-column batches (ne[1] <= 8) too.

This PR attempts to slim down the dependencies for build-msys jobs making the same changes that we applied in whisper.cpp to reduce the size of the github actions cache, and should also improve the run time due to fewer dependencies that need to be installed. I realize this is a scheduled job but I think it would still make sense to apply these changes. Refs: ggml-org/whisper.cpp#3858

* Enroll mul_mat_vec_q_moe into PDL, boosting MTP performance on BW Data collected on a B4500: Before ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=202.8 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=212.8 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=196.4 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=226.6 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=225.1 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=201.5 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=197.2 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=209.2 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=208.9 ``` After ``` (llama.cpp) ➜ llama.cpp git:(master) ✗ python mtp-bench.py code_python pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=211.9 code_cpp pred= 192 draft= 147 acc= 117 rate=0.796 tok/s=224.6 explain_concept pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=207.8 summarize pred= 192 draft= 138 acc= 122 rate=0.884 tok/s=240.2 qa_factual pred= 192 draft= 138 acc= 121 rate=0.877 tok/s=238.5 translation pred= 192 draft= 158 acc= 112 rate=0.709 tok/s=213.4 creative_short pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=208.8 stepwise_math pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=221.7 long_code_review pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=220.7 ``` Server launched with: ``` ➜ llama.cpp git:(osimons/enroll_mul_mat_vec_q_moe_into_PDL) ✗ ./build-x64-linux-gcc-reldbg/bin/llama-server \ -m /mnt/share/gguf/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -dio \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -ngl all \ -fa on \ --host 0.0.0.0 \ --port 8080 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" ``` * LC to overlap with following kernels

…-org#23819)

* hparams : refactor hparams.n_layer * cont : remove `n_layer_kv()`, use n_layer_all instead * cont : type consistency * pi : update SYSTEM.md * models : fix Step3.5 MTP * cont : remove duplicate switch cases * cont : explicitly set `false` to extra layers for `is_swa` and `is_recr` * cont : fix nextn layer count handling Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update quantization readme * install requirements * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * dos2unix suggestions --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

github-actions Bot added examples python server model labels May 19, 2026

am17an commented May 19, 2026

View reviewed changes

am17an force-pushed the gemma4-mtp branch from cd2e5b2 to a03120c Compare May 20, 2026 16:28

am17an force-pushed the gemma4-mtp branch from a03120c to 4b1d1ae Compare May 23, 2026 07:01

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU testing ggml Vulkan script Apple Metal devops OpenCL Hexagon WebGPU build server/ui labels May 23, 2026

am17an force-pushed the gemma4-mtp branch from 4b1d1ae to c073320 Compare May 28, 2026 04:57

github-actions Bot added SYCL AMD ZenDNN android labels May 28, 2026

am17an force-pushed the gemma4-mtp branch from e21d64b to b8e703e Compare June 1, 2026 17:03

am17an added 2 commits June 4, 2026 18:51

llama: Gemma 4 MTP

f268966

fix multi-seq

9af0434

am17an added 5 commits June 4, 2026 18:54

add assert that draft + shared kv should be on same device

7b87cd3

add Q rot when cache is quantized

27461cd

add temp hack to not use fit with gemma4, rm later

777af6a

add exception in test-llama-archs

c0da00a

move assistant to separate file

dd97604

am17an force-pushed the gemma4-mtp branch from b8e703e to dd97604 Compare June 4, 2026 12:47

gugugiyu and others added 17 commits June 4, 2026 16:09

Move duplicated imatrix code into single common imatrix-loader.cpp (g…

e7bcf1c

…gml-org#22445) * Deduplicate imatrix loading code * Add back LLAMA_TRACE, early exit on quantize missing metadata

arg: fix double mtp downloads (ggml-org#24128)

260862b

server : disable on-device spec checkpoints (ggml-org#24108)

7c158fb

add unified assistant

4eaa3ce

kleidiai : dynamic chunck-based scheduling for hybrid execution (ggml…

3ecfb15

…-org#23819)

minor : fix lint issues (ggml-org#24165)

59917d3

Merge branch 'master' into pr/23398

5954f19

cont : adjust to hparams changes

d78a386

cont : avoid computations on the CPU

f0438b1

am17an closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 MTP#17

Gemma4 MTP#17
am17an wants to merge 24 commits into
masterfrom
gemma4-mtp

am17an commented May 19, 2026 •

edited

Loading

Uh oh!

am17an May 19, 2026

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

alexzk1 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

am17an commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an May 19, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

alexzk1 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

am17an commented May 19, 2026 •

edited

Loading

alexzk1 commented Jun 2, 2026 •

edited

Loading