feat(scheduler): align SGLang chunked prefill strategy#6607
Open
Foriv wants to merge 6 commits intoPaddlePaddle:developfrom
Open
feat(scheduler): align SGLang chunked prefill strategy#6607Foriv wants to merge 6 commits intoPaddlePaddle:developfrom
Foriv wants to merge 6 commits intoPaddlePaddle:developfrom
Conversation
added 6 commits
March 2, 2026 10:15
- Fix deadlock when 1 decode request remains - Use sync free_block_ids for block release - Fix preemption failure loop handling (continue vs break) - Align block pre-allocation with SGLang - Ensure decode tasks always scheduled - Fix origin_input_ids AttributeError
…detection - Add chunked_prefill_size parameter in args_utils.py for dynamic calculation based on GPU memory - Add chunked_prefill_size attribute in SchedulerConfig class (default: 8192) - Use paddle.device.cuda.get_device_properties() instead of torch for GPU memory detection - Update resource_manager_v1.py to use min(num_new_tokens, chunked_prefill_size, token_budget) to align with SGLang's _rem_tokens = min(rem_chunk_tokens, rem_total_tokens) logic
- Only reserve max_new_tokens for the last prefill chunk - Use current chunk's token count instead of full prefill tokens - Reduces admission threshold and improves throughput
|
Foriv seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
功能概述
对齐 SGLang 的 Chunked Prefill 调度策略,优化 token budget 计算和预留逻辑:
主要改动
1. Chunked Prefill Size 动态计算
对齐 SGLang 逻辑
SGLang 的计算方式:
FastDeploy 实现:
新增参数: chunked_prefill_size
根据 GPU 显存自动计算:
GPU 显存检测改用 PaddlePaddle
2. 优化 Chunked Prefill 预留策略
只有最后一个 chunk 才预留 max_new_tokens:
3. 对齐 SGLang 的 Chunked Prefill 中间 Chunk 处理
ScheduledExtendBlocksTask 说明
在 chunked prefill 场景下,SGLang 将 prefill 切分成多个 chunk:
ScheduledExtendBlocksTask)问题
Worker 的
gpu_model_runner.py只处理PREFILL、DECODE、PREEMPTED三种任务类型,EXTEND类型无法正确处理。解决方案
将
ScheduledExtendBlocksTask转换为PREFILL任务 (new_tokens=0):PREFILL类型new_tokens=0标识这是中��� chunk 的 extend 操作涉及文件
与 SGLang 的对比
改动对比 (140GB 显存示例)
核心代码逻辑
Chunked Prefill Size 计算
动态单次 Prefill Token 数计算
Chunked Prefill 中间 Chunk 处理
依赖关系
依赖 PR #1 (SGLang 风格的动态资源管理)