add OoTPatchedModuleFusedSDPA #2361

yangulei · 2025-12-11T08:41:16Z

Motivation

Current implementation in vLLM will call FusedSDPA with a huge attention mask, which leads to bad performance. This PR introduced three implementations to get better performance.

Usage

Three environment variables are introduced to control the implementation:

PT_HPU_QKV_SLICE_SEQ_LEN_THLD: int, the threshold for kv_len (=q_len+prefix_len) to apply the implementations, defaults to 4096.
PT_HPU_QKV_SLICE_CHUNK_SIZE: int, chunk size for the slicing in the implementation, defaults to PT_HPU_QKV_SLICE_SEQ_LEN_THLD.
PT_HPU_QKV_SLICE_IMPL: str with choices in ['split_kv', 'slice_causal', 'slice_qkv'], used to select the implementations, defaults to slice_qkv.

Implementations

For a FusedSDPA with q_len=11525 and prefix_len=10752. The lengths will be padded and be truncated respectively to q_len=16384 and prefix_len=8192 before calling the FusedSDPA. The full attention mask is shown bellow, in which the Gray parts steads for the values to be masked out.

Notation

The following images include rectangles with three colors:

rgb(255,0,0): is_causal=False and attn_mask is not None
rgb(255,255,0): is_causal=True and attn_mask=None
rgb(255,0,255): is_causal=False and attn_mask=None

The original implementation

The original implementation pass the full attention mask and set is_causal=False and valid_seq_len=None, which results in bad TPC/MME pipeline. The implementation could be visualized as the following image.

The `SplitKV` implementation

This implementation call the FusedSPDA twice for the prefix part and causal part respectively. And do not pass attn_mask for the prefix part thus gives better performance.

The `SliceCausal` implementation

This implementation further slice the causal part into smaller chunks as illustrated in the following image.

The `SliceQKV` implementation

This implementation further slice the prefix part as shown below.

yangulei · 2025-12-16T08:40:26Z

@yiliu30 @Wei-Lin-Intel @czhu15
Please help to review, thanks!

Wei-Lin-Intel · 2025-12-16T08:54:30Z

neural_compressor/torch/algorithms/fp8_quant/_quant_common/patched_helper_modules.py

+        causal_res = self.fp8_fsdpa_fwd(q, causal_k, causal_v, causal_mask, dropout_p, scale, False, softmax_mode)
+        causal_out, causal_m, causal_linv = (gqa_output_reshape(x) if gqa else x for x in causal_res[:3])
+        causal_m = causal_m.to(torch.float32)
+        causal_linv = causal_linv.to(torch.float32) * (128.0 if softmax_mode != "fp32" else 1.0)


Only fast mode requires * 128.0, for fp32 and None modes, scale is 1.0

yiliu30

LGTM
Please add usage in PR desc :)

yangulei · 2025-12-17T02:14:30Z

LGTM Please add usage in PR desc :)

Done.

czhu15 · 2025-12-17T03:23:31Z

So this solution only truncate the prefix, not padding?

yangulei added 4 commits December 9, 2025 15:32

add OoTPatchedModuleFusedSDPA

501bc7d

fix accuracy issue

73fb58e

fix assert error

3e1f343

add fp8_fsdpa_impl

7a41fbf

yiliu30 added the OoT label Dec 15, 2025

Wei-Lin-Intel reviewed Dec 16, 2025

View reviewed changes

yiliu30 approved these changes Dec 16, 2025

View reviewed changes

yangulei added 2 commits December 17, 2025 10:07

rescale linv for fast softmax_mode only

8858d47

add assert for softmax_mode

88ea159

Merge remote-tracking branch 'origin/v3.6.post.oot' into fsdpa

9bd1cdd

yangulei changed the title ~~[Draft] add OoTPatchedModuleFusedSDPA~~ add OoTPatchedModuleFusedSDPA Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add OoTPatchedModuleFusedSDPA #2361

add OoTPatchedModuleFusedSDPA #2361

yangulei commented Dec 11, 2025 •

edited

Loading

Uh oh!

yangulei commented Dec 16, 2025

Uh oh!

Wei-Lin-Intel Dec 16, 2025

Uh oh!

yangulei Dec 17, 2025

Uh oh!

yiliu30 left a comment

Uh oh!

yangulei commented Dec 17, 2025

Uh oh!

czhu15 commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add OoTPatchedModuleFusedSDPA #2361

Are you sure you want to change the base?

add OoTPatchedModuleFusedSDPA #2361

Conversation

yangulei commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Usage

Implementations

Notation

The original implementation

The SplitKV implementation

The SliceCausal implementation

The SliceQKV implementation

Uh oh!

yangulei commented Dec 16, 2025

Uh oh!

Wei-Lin-Intel Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

yangulei Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

yiliu30 left a comment

Choose a reason for hiding this comment

Uh oh!

yangulei commented Dec 17, 2025

Uh oh!

czhu15 commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yangulei commented Dec 11, 2025 •

edited

Loading

The `SplitKV` implementation

The `SliceCausal` implementation

The `SliceQKV` implementation