feat(scheduler): align SGLang chunked prefill strategy by Foriv · Pull Request #6607 · PaddlePaddle/FastDeploy

Foriv · 2026-03-02T10:51:45Z

功能概述

对齐 SGLang 的 Chunked Prefill 调度策略,优化 token budget 计算和预留逻辑:

Chunked prefill 的单次处理 token 数动态计算
优化预留策略,只有最后 chunk 才预留 max_new_tokens
对齐 SGLang 的 chunked prefill 中间 chunk 处理 (ScheduledExtendBlocksTask)
GPU 显存检测改用 PaddlePaddle

主要改动

1. Chunked Prefill Size 动态计算

对齐 SGLang 逻辑

SGLang 的计算方式:

_rem_tokens = min(rem_chunk_tokens, rem_total_tokens)

FastDeploy 实现:

num_new_tokens = min(
    num_new_tokens,           # 请求剩余 token 数
    chunked_prefill_size,     # chunk 大小(根据显存自动计算)
    token_budget              # batch 总预算(max_num_batched_tokens)
)

新增参数: chunked_prefill_size

根据 GPU 显存自动计算:

# 在 args_utils.py 中
gpu_memory_gb = paddle.device.cuda.get_device_properties(0).total_memory / (1024**3)
chunked_prefill_size = calculate_chunked_prefill_size(gpu_memory_gb)

# 示例: 140GB 显存 → chunked_prefill_size = 8192

GPU 显存检测改用 PaddlePaddle

# 原来: torch.cuda.get_device_properties()
# 现在: paddle.device.cuda.get_device_properties()

2. 优化 Chunked Prefill 预留策略

只有最后一个 chunk 才预留 max_new_tokens:

# 计算当前 chunk 处理的 token 数
num_new_tokens = min(
    request.remaining_tokens_to_prefill,
    chunked_prefill_size,
    token_budget
)

# 判断是否为最后一个 chunk
if request.remaining_tokens_to_prefill <= num_new_tokens:
    # 最后一个 chunk: 预留 max_new_tokens
    reserved_blocks_for_decode = request.max_new_tokens // block_size
else:
    # 非最后 chunk: 不预留
    reserved_blocks_for_decode = 0

# 只预留当前 chunk 的 token,而非完整 prefill
required_tokens_for_prefill = num_new_tokens

3. 对齐 SGLang 的 Chunked Prefill 中间 Chunk 处理

ScheduledExtendBlocksTask 说明

在 chunked prefill 场景下,SGLang 将 prefill 切分成多个 chunk:

第一个 chunk: 作为 prefill 任务
中间 chunk: 作为 extend 任务 (ScheduledExtendBlocksTask)
最后一个 chunk: 作为 prefill 任务并预留 decode 资源

问题

Worker 的 gpu_model_runner.py 只处理 PREFILL、DECODE、PREEMPTED 三种任务类型,EXTEND 类型无法正确处理。

解决方案

将 ScheduledExtendBlocksTask 转换为 PREFILL 任务 (new_tokens=0):

# 在 resource_manager_v1.py
if isinstance(task, ScheduledExtendBlocksTask):
    scheduled_reqs.append(self._prepare_prefill_task(request, 0))

Worker 能正确处理 PREFILL 类型
通过 new_tokens=0 标识这是中�� chunk 的 extend 操作

涉及文件

fastdeploy/
├── engine/
│   ├── args_utils.py                    # 新增 chunked_prefill_size 参数计算逻辑
│   └── sched/
│       └── resource_manager_v1.py       # 实现 chunked prefill 优化和中间 chunk 处理

与 SGLang 的对比

特性	SGLang	FastDeploy (本 PR)	说明
chunked_prefill_size	根据显存计算	完全对齐	140GB → 8192
单次 prefill token 数	min(rem_chunk, rem_total)	完全对齐	三元 min 计算
预留策略	最后 chunk 预留	完全对齐	非最后 chunk 不预留
中间 chunk 处理	extend 任务	转换为 PREFILL	对齐 SGLang 逻辑
GPU 检测	torch	paddle	适配 FastDeploy

改动对比 (140GB 显存示例)

指标	原实现	本 PR
chunked_prefill_size	固定值	8192 (自动计算)
每个 chunk 预留	全量 max_new_tokens	只有最后 chunk
单次 prefill token 数	chunked_prefill_size	min(remaining, 8192, token_budget)
中间 chunk 处理	不支持	支持

核心代码逻辑

Chunked Prefill Size 计算

def calculate_chunked_prefill_size(gpu_memory_gb):
    """根据 GPU 显存计算 chunked_prefill_size"""
    if gpu_memory_gb >= 80:
        return 8192
    elif gpu_memory_gb >= 40:
        return 4096
    else:
        return 2048

# 在 SchedulerConfig 中使用
gpu_memory_gb = paddle.device.cuda.get_device_properties(0).total_memory / (1024**3)
self.chunked_prefill_size = calculate_chunked_prefill_size(gpu_memory_gb)

动态单次 Prefill Token 数计算

def schedule_prefill_request(self, request):
    # 三元 min 对齐 SGLang
    num_new_tokens = min(
        request.remaining_tokens_to_prefill,
        self.chunked_prefill_size,
        self.token_budget
    )
    
    # 判断是否为最后 chunk
    is_last_chunk = (request.remaining_tokens_to_prefill <= num_new_tokens)
    
    # 只有最后 chunk 预留 decode 资源
    reserved_decode = request.max_new_tokens if is_last_chunk else 0
    
    # 只预留当前 chunk 的 token
    required_prefill = num_new_tokens
    
    return num_new_tokens, required_prefill, reserved_decode

Chunked Prefill 中间 Chunk 处理

def prepare_scheduled_tasks(self):
    """
    SGLang 的 chunked prefill 将长 prefill 切分成:
    - 第一个 chunk: prefill 任务
    - 中间 chunk: extend 任务 (ScheduledExtendBlocksTask)
    - 最后 chunk: prefill 任务 + 预留 decode
    """
    scheduled_reqs = []
    
    for request in self.running:
        if isinstance(request.task, ScheduledExtendBlocksTask):
            # 中间 chunk: 转换为 PREFILL 任务 (new_tokens=0)
            scheduled_reqs.append(self._prepare_prefill_task(request, 0))
        elif request.is_prefill:
            scheduled_reqs.append(self._prepare_prefill_task(request, num_new_tokens))
        else:
            scheduled_reqs.append(self._prepare_decode_task(request))
    
    return scheduled_reqs

依赖关系

依赖 PR #1 (SGLang 风格的动态资源管理)

- Fix deadlock when 1 decode request remains - Use sync free_block_ids for block release - Fix preemption failure loop handling (continue vs break) - Align block pre-allocation with SGLang - Ensure decode tasks always scheduled - Fix origin_input_ids AttributeError

…detection - Add chunked_prefill_size parameter in args_utils.py for dynamic calculation based on GPU memory - Add chunked_prefill_size attribute in SchedulerConfig class (default: 8192) - Use paddle.device.cuda.get_device_properties() instead of torch for GPU memory detection - Update resource_manager_v1.py to use min(num_new_tokens, chunked_prefill_size, token_budget) to align with SGLang's _rem_tokens = min(rem_chunk_tokens, rem_total_tokens) logic

- Only reserve max_new_tokens for the last prefill chunk - Use current chunk's token count instead of full prefill tokens - Reduces admission threshold and improves throughput

…dev scheduler

CLAassistant · 2026-03-02T10:51:51Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Foriv seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Foriv added 6 commits March 2, 2026 10:15

align SGLang reserved tokens and decode OOM strategies

916dccd

Add priority scheduling option and align tokens budget with SGLang

5720b88

feat(scheduler): align chunked prefill reservation with SGLang

08e0acf

- Only reserve max_new_tokens for the last prefill chunk - Use current chunk's token count instead of full prefill tokens - Reduces admission threshold and improves throughput

Fix: Convert ScheduledExtendBlocksTask to PREFILL task in fastdeploy-…

a9ad720

…dev scheduler

Foriv had a problem deploying to Metax_ci March 2, 2026 10:51 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): align SGLang chunked prefill strategy#6607

feat(scheduler): align SGLang chunked prefill strategy#6607
Foriv wants to merge 6 commits intoPaddlePaddle:developfrom
Foriv:feature/chunked-prefill-optimization

Foriv commented Mar 2, 2026

Uh oh!

CLAassistant commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Foriv commented Mar 2, 2026

功能概述

主要改动

1. Chunked Prefill Size 动态计算

对齐 SGLang 逻辑

新增参数: chunked_prefill_size

GPU 显存检测改用 PaddlePaddle

2. 优化 Chunked Prefill 预留策略

3. 对齐 SGLang 的 Chunked Prefill 中间 Chunk 处理

ScheduledExtendBlocksTask 说明

问题

解决方案

涉及文件

与 SGLang 的对比

改动对比 (140GB 显存示例)

核心代码逻辑

Chunked Prefill Size 计算

动态单次 Prefill Token 数计算

Chunked Prefill 中间 Chunk 处理

依赖关系

Uh oh!

CLAassistant commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants