server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035
server: avoid unnecessary checkpoint invalidation for recurrent / hybrid models#24035Regrad wants to merge 7 commits into
Conversation
|
This looks like an interesting and simple fix, @ggerganov @ngxson what do you guys think? |
|
I haven't observed unnecessary checkpoint invalidation with recurrent models, so I am not sure what the change is trying to fix. Most of the reports that we get are due to the client injecting stuff in earlier messages which is inefficient, but I don't think we need to try to support. As soon as I observe a valid problem, or get a proper report with a reproduction, I will fix it. I'm using pi daily and haven't observed any problems recently. The only optimization that is currently missing is a follow-up to #22929 to consider past user messages (not just the last one). |
I'm using ryzen 395 max+ and Qwen 3.6 27b, as well as Qwen 3.5 122b. My cache is constantly being flushed, and the processing promt is being created again every time. This fix has resolved the issue. I've tested it on LM Studio on Vulcan (amd radeon 8060s). |
|
Log: |
- Log pos_end rejection with details - Log pos_max rejection for recurrent/hybrid models - Simulate what would happen WITHOUT PR protection - Log pruning reason (pos_end vs pos_max)
…or (ANSI \033[33m)" This reverts commit 66f52fa.
be5c783 to
750c8d8
Compare
|
@Regrad thank you for this branch. it was driving me crazy that Qwen 3.6 35b but also gemma 4 26b as MoE models were always giving me the message "forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory" in the last few weeks which led me to searching and finding this pr. One thing that I still noticed when running this benchmark for llm's https://github.com/alexziskind1/codeneedle was GLM 5.2 worked on checking this large gap of over 2000 tokens between what was invalidated and came up with this investigation and small change I don't know if it's of any use for your work, if this exposes another "problem" with checkpoints and ub values (mine was -b 8192 and -ub 2048) but it my case I got better cache reuse. The fix or changes might be somewhat wrong, just leaving this out here in case it might make any sense for your work in this pr. |
I have reviewed and added your improvement. Since it was created using LLM, I have checked it and made changes to ensure that your correction only affects hybrid models. |
fc18337 to
67798d1
Compare
Summary
Improve prompt checkpoint reuse for recurrent and hybrid models in
llama-server.For these models, the memory position range stored in a checkpoint does not always map cleanly to the reusable prompt prefix length. As a result, valid checkpoints could be discarded during follow-up requests, causing unnecessary prompt re-processing even when the new request still shares a reusable prefix with the previous one.
What changed
This PR stores an additional
pos_endvalue in each prompt checkpoint. It represents the end position of the prompt at the time the checkpoint was created.When selecting a checkpoint for a new request:
pos_max;When pruning checkpoints, a checkpoint is now removed if its saved prompt end exceeds the end of the new prompt, or if its memory range exceeds the current
pos_next.Expected effect
This avoids invalidating reusable checkpoints too aggressively for recurrent and hybrid models.
The expected result is less repeated prompt ingestion on follow-up requests with shared context, especially for long prompts and workloads where the same conversation prefix is reused across multiple requests.
Scope
The change is limited to prompt checkpoint bookkeeping and checkpoint selection in
llama-server.It does not change model inference logic or the existing checkpoint selection behavior for non-recurrent models.