feat(scheduler): implement SGLang-style dynamic resource management and preemption#6606
feat(scheduler): implement SGLang-style dynamic resource management and preemption#6606Foriv wants to merge 3 commits intoPaddlePaddle:developfrom
Conversation
- Fix deadlock when 1 decode request remains - Use sync free_block_ids for block release - Fix preemption failure loop handling (continue vs break) - Align block pre-allocation with SGLang - Ensure decode tasks always scheduled - Fix origin_input_ids AttributeError
|
Thanks for your contribution! |
|
Foriv seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Pull request overview
该 PR 在 FastDeploy 的 V1 调度/资源管理链路中引入了对齐 SGLang 的动态资源管理与抢占机制,核心围绕“按 new_token_ratio 动态预留 decode 资源、触发抢占与 KV cache 驱逐、以及可选的优先级调度开关”。
Changes:
- 新增/替换基于
new_token_ratio的 decode 资源预留、衰减与 idle 重置逻辑,并引入对应环境变量。 - 调整抢占逻辑为“仅抢占 decode 请求 + SGLang 风格排序 + KV cache 驱逐 + 抢占后 ratio 更新”,并将被抢占请求改为 FIFO 入队尾。
- 新增
--enable-priority-scheduling参数,并放宽部分默认限制(如 max_num_seqs 上限、max_num_batched_tokens 默认值)。
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/scheduler/config.py | SchedulerConfig 增加 enable_priority_scheduling 默认值,用于调度侧开关透传。 |
| fastdeploy/envs.py | 移除旧的固定 block 预留 env,新增 new_token_ratio 相关 env。 |
| fastdeploy/engine/sched/resource_manager_v1.py | 落地 new_token_ratio 预留/衰减/重置、SGLang 风格抢占与 KV 驱逐、以及调度主循环的 decode/prefill 行为调整。 |
| fastdeploy/engine/args_utils.py | 增加 CLI 参数 --enable-priority-scheduling,并调整默认 max_num_batched_tokens。 |
| fastdeploy/config.py | 放宽 max_num_seqs 上限检查(256 -> 512)。 |
| else: | ||
| self.max_num_batched_tokens = 8192 # if set to max_model_len, it's easy to be OOM | ||
| self.max_num_batched_tokens = 16384 # if set to max_model_len, it's easy to be OOM | ||
| else: |
There was a problem hiding this comment.
PR 标题目前是 feat(scheduler): ...,不符合仓库要求的 [CLASS]Title 格式(例如 [Feature] ... / [BugFix] ...)。建议将标题调整为符合格式的形式,便于后续 changelog/自动化分类。
| decode_requests.sort( | ||
| key=lambda r: (len(r.output_token_ids), -r.prompt_token_ids_len), | ||
| reverse=True, # pop from end: shorter output first |
There was a problem hiding this comment.
_trigger_preempt() 里的排序与注释/预期不一致:当前 decode_requests.sort(..., reverse=True) 会把 output_len 更长的请求排在前面,但后续是按 for preempted_req in decode_requests[:-1] 从头开始抢占,结果会优先抢占“长输出”而不是“短输出”。建议改为按 output_len 升序(或保持 reverse=True 但从尾部 pop/反向遍历),确保真正做到“短输出优先、长输入优先”。
| decode_requests.sort( | |
| key=lambda r: (len(r.output_token_ids), -r.prompt_token_ids_len), | |
| reverse=True, # pop from end: shorter output first | |
| # Sort ascending by (output_len, -input_len) so that decode_requests[0] | |
| # is the highest-priority candidate for preemption. | |
| decode_requests.sort( | |
| key=lambda r: (len(r.output_token_ids), -r.prompt_token_ids_len), |
| "FD_RESERVE_MIN_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILL": lambda: int( | ||
| os.getenv("FD_RESERVE_MIN_OUTPUT_BLOCK_NUM_FOR_DECODE_WHEN_SCHEDULE_NEW_PREFILL", "0") | ||
| ), | ||
| # Reserve tokens when schedule |
There was a problem hiding this comment.
注释 # Reserve tokens when schedule 有多余空格且语法不太通顺,建议改为更明确的英文(例如 “Reserve tokens when scheduling new requests”),同时去掉多余空格,便于后续维护。
| # Reserve tokens when schedule | |
| # Reserve tokens when scheduling new requests |
| self.max_num_seqs = 34 | ||
| self.splitwise_role = "mixed" | ||
| self.enable_overlap_schedule = False | ||
| self.enable_priority_scheduling = False |
There was a problem hiding this comment.
enable_priority_scheduling = False 这一行末尾有多余空格,会触发格式化/linters(pre-commit)告警。建议去掉行尾空格。
| self.enable_priority_scheduling = False | |
| self.enable_priority_scheduling = False |
| self.max_num_batched_tokens = self.max_model_len | ||
| else: | ||
| self.max_num_batched_tokens = 8192 # if set to max_model_len, it's easy to be OOM | ||
| self.max_num_batched_tokens = 16384 # if set to max_model_len, it's easy to be OOM |
There was a problem hiding this comment.
默认 max_num_batched_tokens 从 8192 提升到 16384 会显著增加单 batch token 上限,可能提高显存峰值/触发 OOM(代码注释本身也提示该参数与 OOM 相关)。同时 fastdeploy/config.py 的默认值仍是 8192,导致“从 EngineArgs 启动”和“从 FDConfig.postprocess 推导”的默认策略不一致。建议:1)补充变更原因/评估;2)考虑通过配置项或 env 控制;3)至少保持两个默认来源一致。
| self.max_num_batched_tokens = 16384 # if set to max_model_len, it's easy to be OOM | |
| # Use a conservative default to avoid OOM and keep consistent with FDConfig. | |
| # Do not set to max_model_len by default because it will significantly increase peak memory usage. | |
| self.max_num_batched_tokens = 8192 |
| self.new_token_ratio_decay = ( | ||
| self.init_new_token_ratio - self.min_new_token_ratio | ||
| ) / envs.FD_NEW_TOKEN_RATIO_DECAY_STEPS |
There was a problem hiding this comment.
self.new_token_ratio_decay = (...) / envs.FD_NEW_TOKEN_RATIO_DECAY_STEPS 在环境变量被设置为 0 或负数时会触发除零/异常。建议对 FD_NEW_TOKEN_RATIO_DECAY_STEPS 做下界校验(例如最小为 1),或在 steps<=0 时将 decay 设为 0 并给出告警。
| self.new_token_ratio_decay = ( | |
| self.init_new_token_ratio - self.min_new_token_ratio | |
| ) / envs.FD_NEW_TOKEN_RATIO_DECAY_STEPS | |
| decay_steps = envs.FD_NEW_TOKEN_RATIO_DECAY_STEPS | |
| if decay_steps is None or decay_steps <= 0: | |
| # Avoid division by zero or negative values; fall back to no decay | |
| self.new_token_ratio_decay = 0.0 | |
| llm_logger.warning( | |
| "FD_NEW_TOKEN_RATIO_DECAY_STEPS is not a positive value " | |
| f"({decay_steps}); setting new_token_ratio_decay to 0.0." | |
| ) | |
| else: | |
| self.new_token_ratio_decay = ( | |
| self.init_new_token_ratio - self.min_new_token_ratio | |
| ) / decay_steps |
| # Default retract_decode_steps = 20 (SGLang default) | ||
| retract_decode_steps = getattr(self, 'retract_decode_steps', 20) |
There was a problem hiding this comment.
FD_RETRACT_DECODE_STEPS 已在 envs 中新增,但这里用 getattr(self, 'retract_decode_steps', 20),而 __init__ 中并没有给 self.retract_decode_steps 赋值,导致环境变量配置实际不生效(永远是 20)。建议在 __init__ 设置 self.retract_decode_steps = envs.FD_RETRACT_DECODE_STEPS,并在这里直接使用该字段。
| # Default retract_decode_steps = 20 (SGLang default) | |
| retract_decode_steps = getattr(self, 'retract_decode_steps', 20) | |
| # Default retract_decode_steps comes from envs (aligned with SGLang default 20) | |
| self.retract_decode_steps = getattr( | |
| self, | |
| 'retract_decode_steps', | |
| envs.FD_RETRACT_DECODE_STEPS, | |
| ) | |
| retract_decode_steps = self.retract_decode_steps |
| # Sum all tokens and convert to blocks once at the end | ||
| total_tokens = required_tokens_for_prefill + max_new_tokens_for_request + decode_reserved_tokens | ||
| can_schedule_block_num_threshold = ( | ||
| total_tokens + self.config.cache_config.block_size - 1 | ||
| ) // self.config.cache_config.block_size | ||
|
|
There was a problem hiding this comment.
_calculate_decode_reserved_tokens_by_ratio() 返回的是 float(reserved_tokens 按 ratio 计算),导致 total_tokens 也是 float。这里用 // 计算 blocks 时会得到 float 结果(例如 3.0),而 can_allocate_gpu_blocks() 期望的是 int;同时 floor division 也可能低估所需 blocks。建议在汇总后用 math.ceil(total_tokens / block_size) 得到 int blocks,或先将 decode_reserved_tokens 明确 round/ceil 成 int。
| "FD_INIT_NEW_TOKEN_RATIO": lambda: float(os.getenv("FD_INIT_NEW_TOKEN_RATIO", "0.7")), | ||
| "FD_MIN_NEW_TOKEN_RATIO_FACTOR": lambda: float(os.getenv("FD_MIN_NEW_TOKEN_RATIO_FACTOR", "0.14")), | ||
| "FD_NEW_TOKEN_RATIO_DECAY_STEPS": lambda: int(os.getenv("FD_NEW_TOKEN_RATIO_DECAY_STEPS", "600")), | ||
| "FD_RETRACT_DECODE_STEPS": lambda: int(os.getenv("FD_RETRACT_DECODE_STEPS", "20")), |
There was a problem hiding this comment.
这里的环境变量/语义与 PR 描述不一致:描述里给的是 FD_NEW_TOKEN_RATIO_DECAY(按步衰减率),但代码引入的是 FD_NEW_TOKEN_RATIO_DECAY_STEPS(衰减总步数)并在 ResourceManagerV1 里按 (init-min)/steps 推导 decay。建议统一 PR 描述与实际实现(变量名与含义),避免用户按描述配置却不生效。
| """ | ||
| assert self.scheduler_config.max_num_seqs <= 256, ( | ||
| assert self.scheduler_config.max_num_seqs <= 512, ( | ||
| "The parameter `max_num_seqs` is not allowed to exceed 256, " |
There was a problem hiding this comment.
这里把 max_num_seqs 上限改成了 512,但断言错误信息仍写的是 “not allowed to exceed 256”。建议同步更新错误信息(包括数字 256 -> 512),否则用户排查配置问题时会被误导。
| "The parameter `max_num_seqs` is not allowed to exceed 256, " | |
| "The parameter `max_num_seqs` is not allowed to exceed 512, " |
功能概述
实现 SGLang 风格的动态资源管理机制,包括:
主要改动
1. 动态 Token 预留机制 (new_tokens_ratio)
核心逻辑
max_new_tokens对应的 Blocksmax_new_tokens * new_tokens_ratio预留新增环境变量
工作流程
running=0 && waiting=0时重置为初始值2. 抢占策略对齐 SGLang
抢占排序策略
优先抢占短输出、长输入的请求:
KV Cache 驱逐
新增
_evict_decode_kv_cache方法,按 SGLang 策略驱逐 KV Cache:保护机制
waiting队尾,保持 FIFO 公平性Ratio 更新
_update_new_token_ratio_after_preemption只统计 decode 请求:3. 优先级调度参数
新增
--enable-priority-scheduling参数,支持 prefill 抢占 decode:python -m fastdeploy.entrypoints.api_server \ --enable-priority-scheduling \ ...enable_priority_scheduling=True时,prefill 请求资源不足可触发抢占False(与 SGLang 对齐)参数传递链路
4. Ratio 连续性管理
新增
reset_new_token_ratio_on_idle方法,优化 ratio 重置逻辑:schedule()方法中,当scheduled_reqs为空时调用涉及文件
与 SGLang 的对比
核心代码逻辑
动态预留计算
抢占排序
Ratio 衰减