Add split-k optimization for sm90, reduce through DSMEM. by Insideyyy · Pull Request #186 · deepseek-ai/DeepGEMM

Insideyyy · 2025-09-05T05:30:39Z

This PR adds split-k optimization for sm90, reduce partitioned d through DSMEM.
Currently support fp8 & bf16 Normal, MGroupedContiguous, MGroupedMasked gemms on sm90.

fp8_gemm_1d2d on H20:

m x n x k	TFLOPS w/o split-k	TFLOPS w/ split-k (optional)
128 x 64 x 8192	12	21
128 x 128 x 8192	24	35
128 x 256 x 8192	47	64
128 x 1024 x 8192	137	137
128 x 1280 x 8192	137	151
256 x 64 x 8192	24	32
256 x 128 x 8192	47	64
256 x 256 x 8192	93	93
256 x 1024 x 8192	181	180
256 x 1280 x 8192	190	198

bf16_gemm on H20:

m x n x k	TFLOPS w/o split-k	TFLOPS w/ split-k (optional)
128 x 64 x 8192	7	15
128 x 128 x 8192	13	25
128 x 256 x 8192	26	42
128 x 1024 x 8192	76	76
128 x 1280 x 8192	76	90
256 x 64 x 8192	13	21
256 x 128 x 8192	26	41
256 x 256 x 8192	52	51
256 x 1024 x 8192	99	99
256 x 1280 x 8192	104	112

Notes:

Split-k is enabled automatically if possible to improve SM utilization.
The k_slices partitions of same (m_block_idx, n_block_idx) are assigned to k_slices SMs within a thread block cluster, so that the intermediate results could be reduced through DSMEM.

LyricZhao · 2025-09-10T09:53:11Z

Great point for some shapes, may take some time to merge. Thanks!

Insideyyy · 2025-10-15T08:20:38Z

@LyricZhao Hello! The conflicts are resolved, is there a plan to merge?

wuweiwhu · 2025-11-24T08:36:43Z

Hello @Insideyyy , really great work! Is there a comparison with the vanlila Split-K implementation?

Insideyyy · 2025-12-15T13:08:35Z

Hello @Insideyyy , really great work! Is there a comparison with the vanlila Split-K implementation?

Did you mean launching another kernel to reduce k?
In most cases I tested, it brings 2us more overhead than reducing through DSMEM.

Add split-k optimization for sm90, reducing through DSMEM.

c6e6b82

Insideyyy mentioned this pull request Sep 5, 2025

Support StreamK when scheduling #41

Closed

Resolve conflicts; support split-k for sm90_fp8_gemm_1d1d.

aecf90d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add split-k optimization for sm90, reduce through DSMEM.#186

Add split-k optimization for sm90, reduce through DSMEM.#186
Insideyyy wants to merge 2 commits intodeepseek-ai:mainfrom
Insideyyy:Insideyyy/split-k-sm90

Insideyyy commented Sep 5, 2025 •

edited

Loading

Uh oh!

LyricZhao commented Sep 10, 2025

Uh oh!

Insideyyy commented Oct 15, 2025

Uh oh!

wuweiwhu commented Nov 24, 2025 •

edited

Loading

Uh oh!

Insideyyy commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Insideyyy commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LyricZhao commented Sep 10, 2025

Uh oh!

Insideyyy commented Oct 15, 2025

Uh oh!

wuweiwhu commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Insideyyy commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Insideyyy commented Sep 5, 2025 •

edited

Loading

wuweiwhu commented Nov 24, 2025 •

edited

Loading