Skip to content

feat: add auto-detection for Blackwell GPU architecture (sm_100a)#550

Open
yurekami wants to merge 2 commits intodeepseek-ai:mainfrom
yurekami:fix-blackwell-gpu-support
Open

feat: add auto-detection for Blackwell GPU architecture (sm_100a)#550
yurekami wants to merge 2 commits intodeepseek-ai:mainfrom
yurekami:fix-blackwell-gpu-support

Conversation

@yurekami
Copy link
Contributor

Summary

Add automatic GPU architecture detection to properly set TORCH_CUDA_ARCH_LIST for Blackwell GPUs (sm_100a/10.0a).

Problem

When building DeepEP on Blackwell GPUs, users encounter errors like:

Target SM ARCH unknown is not compatible
cudaErrorInsufficientDriver

This is because the default TORCH_CUDA_ARCH_LIST is set to 9.0, which doesn't match Blackwell's sm_100a architecture.

Solution

  • Add get_cuda_arch_from_device() function to automatically detect GPU architecture
  • Auto-detect sm_100a for Blackwell, sm_90a for Hopper
  • Print detected architecture during build for user feedback
  • Fall back to 9.0 if detection fails (maintains backward compatibility)

Users can still override with TORCH_CUDA_ARCH_LIST environment variable.

Test plan

  • Build on Blackwell GPU system
  • Build on Hopper GPU system
  • Verify fallback works when GPU detection fails

Fixes #519

🤖 Generated with Claude Code

Contributor and others added 2 commits December 29, 2025 03:43
Add automatic GPU architecture detection to properly set TORCH_CUDA_ARCH_LIST
for Blackwell GPUs (sm_100a/10.0a). This fixes build and runtime errors when
using DeepEP on Blackwell systems.

Changes:
- Add get_cuda_arch_from_device() function to detect GPU architecture
- Auto-detect sm_100a for Blackwell, sm_90a for Hopper
- Print detected architecture during build for user feedback
- Fall back to 9.0 if detection fails

Users can still override with TORCH_CUDA_ARCH_LIST environment variable.

Fixes deepseek-ai#519

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…seek-ai#548)

The low-latency P2P communication hangs when using 8-byte width token counts
because the signaling buffer type (int) doesn't match the expected size.

Changes:
- Add LL_SIGNAL_BITS config (default 32) with ll_signal_t typedef
- Add int64_t atomic primitives (ld_acquire_sys_global, st_release_sys_global)
- Update buffer types from int* to ll_signal_t* for signaling buffers
- Add conditional compilation for 64-bit NVSHMEM atomic operations
- Fix boundary check to use sizeof(ll_signal_t)

To enable 64-bit signaling: compile with -DLL_SIGNAL_BITS=64

Fixes: deepseek-ai#548

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

assert not compatible on Blackwell GPUs

1 participant