server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models by Regrad · Pull Request #24035 · ggml-org/llama.cpp

Regrad · 2026-06-02T16:52:23Z

Summary

Improve prompt checkpoint reuse for recurrent and hybrid models in llama-server.

For these models, the memory position range stored in a checkpoint does not always map cleanly to the reusable prompt prefix length. As a result, valid checkpoints could be discarded during follow-up requests, causing unnecessary prompt re-processing even when the new request still shares a reusable prefix with the previous one.

What changed

This PR stores an additional pos_end value in each prompt checkpoint. It represents the end position of the prompt at the time the checkpoint was created.

When selecting a checkpoint for a new request:

checkpoints extending beyond the end of the new prompt are rejected;
recurrent and hybrid models use a dedicated reuse condition based on pos_max;
non-recurrent models keep the existing SWA-based condition.

When pruning checkpoints, a checkpoint is now removed if its saved prompt end exceeds the end of the new prompt, or if its memory range exceeds the current pos_next.

Expected effect

This avoids invalidating reusable checkpoints too aggressively for recurrent and hybrid models.

The expected result is less repeated prompt ingestion on follow-up requests with shared context, especially for long prompts and workloads where the same conversation prefix is reused across multiple requests.

Scope

The change is limited to prompt checkpoint bookkeeping and checkpoint selection in llama-server.

It does not change model inference logic or the existing checkpoint selection behavior for non-recurrent models.

pwilkin · 2026-06-03T09:53:39Z

This looks like an interesting and simple fix, @ggerganov @ngxson what do you guys think?

ggerganov · 2026-06-03T10:41:57Z

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

Regrad · 2026-06-03T11:21:45Z

I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support.

As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one).

I'm using ryzen 395 max+ and Qwen 3.6 27b, as well as Qwen 3.5 122b. My cache is constantly being flushed, and the processing promt is being created again every time. This fix has resolved the issue. I've tested it on LM Studio on Vulcan (amd radeon 8060s).

Regrad · 2026-06-03T11:30:20Z

Log:

18: slot update_slots: id  3 | task 17322 | new prompt, n_ctx_slot = 262144, n_keep = 15, task.n_tokens = 15
19: slot update_slots: id  3 | task 17322 | cache reuse is not supported - ignoring n_cache_reuse = 256
20: slot update_slots: id  3 | task 17322 | n_past = 15, slot.prompt.tokens.size() = 22, seq_id = 3, pos_min = 21, n_swa = 1
21: slot update_slots: id  3 | task 17322 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
22: slot update_slots: id  3 | task 17322 | n_tokens = 0, memory_seq_rm [0, end)
23: slot update_slots: id  3 | task 17322 | prompt processing progress, n_tokens = 11, batch.n_tokens = 12, progress = 0.733333
24: [2026-04-02 11:55:04][INFO][qwen3.5-122b-a10b@?] Prompt processing progress: 0.0%

192407: slot update_slots: id  0 | task 97964 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
192408: slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 154657, pos_max = 154657, n_tokens = 154658, n_swa = 0, pos_next = 0, size = 62.813 MiB)
192409: [2026-05-02 23:14:58][DEBUG] slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 154767, pos_max = 154767, n_tokens = 154768, n_swa = 0, pos_next = 0, size = 62.813 MiB)
192410: [2026-05-02 23:14:58][DEBUG] slot update_slots: id  0 | task 97964 | erased invalidated context checkpoint (pos_min = 155123, pos_max = 155123, n_tokens = 155124, n_swa = 0, pos_next = 0, size = 62.813 MiB)

194477: [2026-05-02 23:23:01][DEBUG] slot update_slots: id  0 | task 98398 | restored context checkpoint (pos_min = 18818, pos_max = 18818, n_tokens = 18819, n_past = 18819, size = 62.813 MiB)
195051: [2026-05-02 23:23:30][DEBUG] slot update_slots: id  0 | task 98572 | restored context checkpoint (pos_min = 8191, pos_max = 8191, n_tokens = 8192, n_past = 8192, size = 62.813 MiB)

- Log pos_end rejection with details - Log pos_max rejection for recurrent/hybrid models - Simulate what would happen WITHOUT PR protection - Log pruning reason (pos_end vs pos_max)

… \033[33m)

…or (ANSI \033[33m)" This reverts commit 66f52fa.

ichim-david · 2026-06-15T16:02:49Z

@Regrad thank you for this branch. it was driving me crazy that Qwen 3.6 35b but also gemma 4 26b as MoE models were always giving me the message "forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory" in the last few weeks which led me to searching and finding this pr.

One thing that I still noticed when running this benchmark for llm's https://github.com/alexziskind1/codeneedle was
something like this:
3.11.883.089 W slot update_slots: id 0 | task 986 | restored context checkpoint (pos_min = 9675, pos_max = 9675, pos_end = 9676, n_tokens = 9676, n_past = 9676, size = 81.896 MiB) 3.11.883.092 W slot update_slots: id 0 | task 986 | erased invalidated context checkpoint (pos_min = 11718, pos_max = 11718, pos_end = 11719, n_tokens = 11719, n_swa = 0, prefix_end = 11586, pos_next = 9676, size = 85.925 MiB)

GLM 5.2 worked on checking this large gap of over 2000 tokens between what was invalidated and came up with this investigation and small change
https://gist.github.com/ichim-david/b1f635868d62442894caf019041cdaf3

I don't know if it's of any use for your work, if this exposes another "problem" with checkpoints and ub values (mine was -b 8192 and -ub 2048) but it my case I got better cache reuse.

The fix or changes might be somewhat wrong, just leaving this out here in case it might make any sense for your work in this pr.

Regrad · 2026-06-15T19:38:38Z

@Regrad thank you for this branch. it was driving me crazy that Qwen 3.6 35b but also gemma 4 26b as MoE models were always giving me the message "forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory" in the last few weeks which led me to searching and finding this pr.

I have reviewed and added your improvement. Since it was created using LLM, I have checked it and made changes to ensure that your correction only affects hybrid models.

Regrad requested review from a team as code owners June 2, 2026 16:52

github-actions Bot added examples server labels Jun 2, 2026

Regrad closed this Jun 2, 2026

Regrad reopened this Jun 2, 2026

Regrad marked this pull request as draft June 4, 2026 09:19

rankaiyx pushed a commit to rankaiyx/llama.cpp that referenced this pull request Jun 6, 2026

style: highlight PR ggml-org#24035 debug logs with yellow color (ANSI…

66f52fa

… \033[33m)

rankaiyx pushed a commit to rankaiyx/llama.cpp that referenced this pull request Jun 6, 2026

Revert "style: highlight PR ggml-org#24035 debug logs with yellow col…

0f6a4a9

…or (ANSI \033[33m)" This reverts commit 66f52fa.

Regrad force-pushed the fix/qwen-hybrid-checkpoint-reuse branch from be5c783 to 750c8d8 Compare June 15, 2026 11:25

Regrad marked this pull request as ready for review June 15, 2026 14:30

Regrad added 5 commits June 15, 2026 23:19

server: improve checkpoint reuse heuristics for recurrent/hybrid models

2aabf8a

server: retain prompt checkpoint coverage

74eaaab

server: adapt checkpoint retention to list storage

b082bd0

server: checkpoint the shared prompt prefix

8f23fd6

server: limit prefix checkpoints to recurrent models

67798d1

Regrad force-pushed the fix/qwen-hybrid-checkpoint-reuse branch from fc18337 to 67798d1 Compare June 15, 2026 20:34

server: checkpoint before media chunks

001ee99

Regrad marked this pull request as draft June 16, 2026 07:20

server: anchor reusable prefix checkpoints

9b11fe7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035

server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035
Regrad wants to merge 7 commits into
ggml-org:masterfrom
Regrad:fix/qwen-hybrid-checkpoint-reuse

Regrad commented Jun 2, 2026 •

edited

Loading

Uh oh!

pwilkin commented Jun 3, 2026

Uh oh!

ggerganov commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

ichim-david commented Jun 15, 2026 •

edited

Loading

Uh oh!

Regrad commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Regrad commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Expected effect

Scope

Uh oh!

pwilkin commented Jun 3, 2026

Uh oh!

ggerganov commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

Regrad commented Jun 3, 2026

Uh oh!

ichim-david commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Regrad commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Regrad commented Jun 2, 2026 •

edited

Loading

ichim-david commented Jun 15, 2026 •

edited

Loading