Conversation
When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor
Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor
Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor
On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor
Implements @qd.kernel(graph_while='flag_arg') which wraps the kernel offloaded tasks in a CUDA conditional while node (requires SM 9.0+). The named argument is a scalar i32 ndarray on device; the loop continues while its value is non-zero. Key implementation details: - Condition kernel compiled as PTX and JIT-linked with libcudadevrt.a at runtime to access cudaGraphSetConditional device function - CU_GRAPH_COND_ASSIGN_DEFAULT flag ensures handle is reset each launch - Works with both counter-based (decrement to 0) and boolean flag (set to 0 when done) patterns - graph_while implicitly enables cuda_graph=True Tests: counter, boolean done flag, multiple loops, graph replay. Made-with: Cursor
…allback The graph_while_arg_id was computed using Python-level parameter indices, which is wrong when struct parameters are flattened into many C++ args (e.g. Genesis solver has 40 C++ params from 6 Python params). Now tracks the flattened C++ arg index during launch context setup and caches it. Also adds C++ do-while fallback loops for CPU, CUDA (non-graph path), and AMDGPU backends so graph_while works identically on all platforms. Made-with: Cursor
Made-with: Cursor
# Conflicts: # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp
|
opus-4.6 review:
The Python API is @qd.kernel(cuda_graph=True). When a kernel has 2+ top-level for loops (offloaded tasks), it captures them into a CUDA graph on first launch and replays
The API is @qd.kernel(graph_while='counter') where 'counter' names a scalar i32 ndarray argument. This wraps the kernel tasks in a CUDA conditional while node (SM 9.0+ / Things that look good • Clean API layering: flags flow from Python decorator through LaunchContextBuilder to the C++ launcher without polluting intermediate layers. Concerns / things I'd flag
──────────────────────────────────────── Overall, it's a solid MVP. The architecture is sound, the explicit graph node API is the right choice, and the graph_while feature is a compelling addition for iterative |
Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.
Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.
|
additional information from opus on point 4: thoughts on 'No graph cache eviction'? Honestly, on closer look I think this is a non-issue and I was wrong to flag it. The only way this could blow up is if someone programmatically generates many distinct kernels with cuda_graph=True in a loop, which would be a misuse with much bigger |
Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.
|
another opus update: thoughts on 'Thread safety -- cuda_graph_cache_, cond_kernel_func_, and cond_kernel_module_ have no synchronization.'? Not a real concern. Quadrants kernel launches go through a single KernelLauncher instance per device, and the launch path is inherently sequential -- you can't |
These flags were being set on the autodiff adjoint kernel without any testing. Leave them off until the interaction with reverse-mode AD is explicitly validated.
|
repalced by smaller prs |
Issue: #
Brief Summary
Example usage:
copilot:summary
Walkthrough
copilot:walkthrough