feat: add auto-detection for Blackwell GPU architecture (sm_100a) by yurekami · Pull Request #550 · deepseek-ai/DeepEP

yurekami · 2025-12-28T18:45:32Z

Summary

Add automatic GPU architecture detection to properly set TORCH_CUDA_ARCH_LIST for Blackwell GPUs (sm_100a/10.0a).

Problem

When building DeepEP on Blackwell GPUs, users encounter errors like:

Target SM ARCH unknown is not compatible
cudaErrorInsufficientDriver

This is because the default TORCH_CUDA_ARCH_LIST is set to 9.0, which doesn't match Blackwell's sm_100a architecture.

Solution

Add get_cuda_arch_from_device() function to automatically detect GPU architecture
Auto-detect sm_100a for Blackwell, sm_90a for Hopper
Print detected architecture during build for user feedback
Fall back to 9.0 if detection fails (maintains backward compatibility)

Users can still override with TORCH_CUDA_ARCH_LIST environment variable.

Test plan

Build on Blackwell GPU system
Build on Hopper GPU system
Verify fallback works when GPU detection fails

Fixes #519

🤖 Generated with Claude Code

Add automatic GPU architecture detection to properly set TORCH_CUDA_ARCH_LIST for Blackwell GPUs (sm_100a/10.0a). This fixes build and runtime errors when using DeepEP on Blackwell systems. Changes: - Add get_cuda_arch_from_device() function to detect GPU architecture - Auto-detect sm_100a for Blackwell, sm_90a for Hopper - Print detected architecture during build for user feedback - Fall back to 9.0 if detection fails Users can still override with TORCH_CUDA_ARCH_LIST environment variable. Fixes deepseek-ai#519 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…seek-ai#548) The low-latency P2P communication hangs when using 8-byte width token counts because the signaling buffer type (int) doesn't match the expected size. Changes: - Add LL_SIGNAL_BITS config (default 32) with ll_signal_t typedef - Add int64_t atomic primitives (ld_acquire_sys_global, st_release_sys_global) - Update buffer types from int* to ll_signal_t* for signaling buffers - Add conditional compilation for 64-bit NVSHMEM atomic operations - Fix boundary check to use sizeof(ll_signal_t) To enable 64-bit signaling: compile with -DLL_SIGNAL_BITS=64 Fixes: deepseek-ai#548 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Contributor and others added 2 commits December 29, 2025 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add auto-detection for Blackwell GPU architecture (sm_100a)#550

feat: add auto-detection for Blackwell GPU architecture (sm_100a)#550
yurekami wants to merge 2 commits intodeepseek-ai:mainfrom
yurekami:fix-blackwell-gpu-support

yurekami commented Dec 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yurekami commented Dec 28, 2025

Summary

Problem

Solution

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant