[Perf] Add CUDA Graph support by hughperkins · Pull Request #393 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-01T16:33:58Z

Issue: #

Brief Summary

Example usage:

    @qd.kernel(graph_while='counter')
    def inc(x: qd.types.ndarray(qd.i32, ndim=1),
            counter: qd.types.ndarray(qd.i32, ndim=0)):
        for i in range(x.shape[0]):
            x[i] = x[i] + 1
        for i in range(1):
            counter[None] = counter[None] - 1

    @qd.kernel(graph_while='keep_going')
    def increment_until_all_done(
            x: qd.types.ndarray(qd.i32, ndim=1),
            thresholds: qd.types.ndarray(qd.i32, ndim=1),
            keep_going: qd.types.ndarray(qd.i32, ndim=0)):
        # Work: increment elements that haven't reached their threshold
        for i in range(x.shape[0]):
            if x[i] < thresholds[i]:
                x[i] = x[i] + 1

        # Reduction: reset flag, then OR-reduce per-element conditions
        for i in range(1):
            keep_going[None] = 0
        for i in range(x.shape[0]):
            if x[i] < thresholds[i]:
                keep_going[None] = 1

copilot:summary

Walkthrough

copilot:walkthrough

When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor

Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor

Implements @qd.kernel(graph_while='flag_arg') which wraps the kernel offloaded tasks in a CUDA conditional while node (requires SM 9.0+). The named argument is a scalar i32 ndarray on device; the loop continues while its value is non-zero. Key implementation details: - Condition kernel compiled as PTX and JIT-linked with libcudadevrt.a at runtime to access cudaGraphSetConditional device function - CU_GRAPH_COND_ASSIGN_DEFAULT flag ensures handle is reset each launch - Works with both counter-based (decrement to 0) and boolean flag (set to 0 when done) patterns - graph_while implicitly enables cuda_graph=True Tests: counter, boolean done flag, multiple loops, graph replay. Made-with: Cursor

…allback The graph_while_arg_id was computed using Python-level parameter indices, which is wrong when struct parameters are flattened into many C++ args (e.g. Genesis solver has 40 C++ params from 6 Python params). Now tracks the flattened C++ arg index during launch context setup and caches it. Also adds C++ do-while fallback loops for CPU, CUDA (non-graph path), and AMDGPU backends so graph_while works identically on all platforms. Made-with: Cursor

Made-with: Cursor

# Conflicts: # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp

hughperkins · 2026-03-05T19:36:22Z

opus-4.6 review:

Per-kernel CUDA graph capture/replay (cuda_graph=True)

The Python API is @qd.kernel(cuda_graph=True). When a kernel has 2+ top-level for loops (offloaded tasks), it captures them into a CUDA graph on first launch and replays
on subsequent launches, eliminating per-kernel launch overhead.
Key design decisions:
• Uses the explicit graph node API (cuGraphAddKernelNode) rather than stream capture -- good choice, gives more control and avoids the pitfalls of stream capture.
• Persistent device buffers for arg/result survive across replays. On each replay, the arg buffer is re-uploaded so changed ndarray pointers are picked up correctly.
• Falls back silently to the normal launch path for single-loop kernels or non-CUDA backends (harmless no-op).

GPU-side conditional while loops (graph_while='flag_arg')

The API is @qd.kernel(graph_while='counter') where 'counter' names a scalar i32 ndarray argument. This wraps the kernel tasks in a CUDA conditional while node (SM 9.0+ /
Hopper required), enabling GPU-side iteration without CPU round-trips -- very relevant for convergence loops in physics solvers.
A small PTX condition kernel is JIT-linked with libcudadevrt.a at runtime to call cudaGraphSetConditional. Cross-platform fallback (CPU, AMDGPU, CUDA non-graph) uses a
host-side do-while loop with device readback.

Things that look good

• Clean API layering: flags flow from Python decorator through LaunchContextBuilder to the C++ launcher without polluting intermediate layers.
• Cross-platform parity: all three backends (CPU, CUDA, AMDGPU) get the graph_while do-while fallback, so semantics are identical.
• Arg-change handling: re-resolving ndarray device pointers and re-uploading the arg buffer on replay is the right approach.
• Test coverage: comprehensive -- two-loop, three-loop, single-loop fallback, no-annotation, changed args, graph_while with counter/boolean/multi-loop/replay, and
cross-backend variants.

Concerns / things I'd flag

CudaGraphNodeParams struct layout is fragile -- it manually mirrors the CUDA driver API struct with hardcoded offsets and padding. If NVIDIA changes the layout
between driver versions, this breaks silently. Consider using the actual CUDA headers or at least a static_assert on sizeof.
No compute capability check -- the PTX targets sm_90 and conditional graph nodes require Hopper. If you run graph_while on an Ampere or older GPU, it will presumably
fail at JIT link time with an unhelpful error. A runtime check with a clear error message would be better.
libcudadevrt.a discovery is brittle -- only two hardcoded paths are checked. Custom toolkit installs (e.g. via CUDA_HOME or conda) won't be found. Should check
environment variables like CUDA_HOME/CUDA_PATH.
No graph cache eviction -- cuda_graph_cache_ grows monotonically. Each entry holds persistent device arg/result buffers. In a long-running process with many distinct
kernels, this could leak significant GPU memory.
Removed documentation comments -- the diff strips several explanatory comments from the existing CUDA kernel launcher (e.g., the purpose of transfers and
device_ptrs). In a diff that adds substantial complexity, removing existing documentation is undesirable.
Do-while semantics -- the body always executes at least once. This is documented in tests but could surprise users who set counter=0. Worth documenting explicitly in
the API or raising an error for counter <= 0.
Thread safety -- cuda_graph_cache_, cond_kernel_func_, and cond_kernel_module_ have no synchronization. Fine if Quadrants is single-threaded, but worth a comment
noting this assumption.
No test for autodiff interaction -- adjoint kernels also get use_cuda_graph and graph_while_arg set. There are no tests verifying this works (or is properly ignored)
for reverse-mode AD kernels.

────────────────────────────────────────

Overall, it's a solid MVP. The architecture is sound, the explicit graph node API is the right choice, and the graph_while feature is a compelling addition for iterative
GPU workloads. The main risks are around portability (hardcoded paths, struct layouts, no SM version check) and missing edge-case handling.

Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.

Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.

hughperkins · 2026-03-05T19:45:15Z

additional information from opus on point 4:

thoughts on 'No graph cache eviction'?

Honestly, on closer look I think this is a non-issue and I was wrong to flag it.
The cache is keyed by launch_id, which corresponds to a distinct compiled kernel registration — not to individual calls. So the cache has at most one entry per
@qd.kernel(cuda_graph=True) definition. It doesn't grow with the number of invocations.
In practice:
• Users explicitly opt in per-kernel, so the number of entries is small (realistically 1-10 in a typical app)
• Each entry holds two small device buffers (arg buffer is typically a few hundred bytes, result buffer is ~8 bytes minimum) plus one graph exec object — negligible GPU
memory

The only way this could blow up is if someone programmatically generates many distinct kernels with cuda_graph=True in a loop, which would be a misuse with much bigger
problems (JIT compilation cost per kernel).
I'd leave it as-is. Adding LRU eviction would be complexity for no real benefit.

Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.

hughperkins · 2026-03-05T19:51:48Z

another opus update:

thoughts on 'Thread safety -- cuda_graph_cache_, cond_kernel_func_, and cond_kernel_module_ have no synchronization.'?

Not a real concern. Quadrants kernel launches go through a single KernelLauncher instance per device, and the launch path is inherently sequential -- you can't
meaningfully launch two CUDA kernels on the same stream from different threads simultaneously. The CUDA driver context itself serializes stream operations.
If Quadrants ever adds multi-threaded launch support, it would require much broader synchronization changes (the existing contexts_ vector has the same non-thread-safe
pattern), so cuda_graph_cache_ wouldn't be the first thing to break. Not worth adding a mutex for a scenario that doesn't exist.
I'd leave it as-is.

These flags were being set on the autodiff adjoint kernel without any testing. Leave them off until the interaction with reverse-mode AD is explicitly validated.

hughperkins · 2026-03-12T05:01:44Z

repalced by smaller prs

hughperkins added 9 commits February 28, 2026 20:28

bug fixes for cuda graph

4bd9b01

Add cross-platform test for cuda_graph=True annotation

f78f1eb

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

Fix lint: import sorting, formatting (black, clang-format, ruff)

a3032db

Made-with: Cursor

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp

e938799

# Conflicts: # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cuda/kernel_launcher.cpp

hughperkins added 3 commits March 5, 2026 14:40

Add static_assert on CudaGraphNodeParams size to catch ABI drift

dabdef4

Add compute capability check for graph_while (requires SM 9.0+)

e9824b2

Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.

Use CUDA_HOME/CUDA_PATH env vars to find libcudadevrt.a

20f0522

Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.

hughperkins added 2 commits March 5, 2026 14:46

Restore documentation comments removed during cuda-graph refactor

0033d75

Add CUDA graph documentation and do-while semantics warning

e5a8f14

Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.

hughperkins added 2 commits March 5, 2026 14:53

Disable cuda_graph and graph_while on adjoint kernels

2f34861

These flags were being set on the autodiff adjoint kernel without any testing. Leave them off until the interaction with reverse-mode AD is explicitly validated.

Apply clang-format to kernel_launcher.h static_assert

37a4459

hughperkins assigned erizmr Mar 8, 2026

hughperkins marked this pull request as draft March 11, 2026 19:40

hughperkins closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Add CUDA Graph support#393

[Perf] Add CUDA Graph support#393
hughperkins wants to merge 16 commits intomainfrom
hp/cuda-graph-mvp

hughperkins commented Mar 1, 2026 •

edited

Loading

Uh oh!

hughperkins commented Mar 5, 2026

Uh oh!

hughperkins commented Mar 5, 2026

Uh oh!

hughperkins commented Mar 5, 2026

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Mar 5, 2026

Uh oh!

hughperkins commented Mar 5, 2026

Uh oh!

hughperkins commented Mar 5, 2026

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Mar 1, 2026 •

edited

Loading