Skip to content

[Perf] Add CUDA Graph support#393

Closed
hughperkins wants to merge 16 commits intomainfrom
hp/cuda-graph-mvp
Closed

[Perf] Add CUDA Graph support#393
hughperkins wants to merge 16 commits intomainfrom
hp/cuda-graph-mvp

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Mar 1, 2026

Issue: #

Brief Summary

Example usage:

    @qd.kernel(graph_while='counter')
    def inc(x: qd.types.ndarray(qd.i32, ndim=1),
            counter: qd.types.ndarray(qd.i32, ndim=0)):
        for i in range(x.shape[0]):
            x[i] = x[i] + 1
        for i in range(1):
            counter[None] = counter[None] - 1
    @qd.kernel(graph_while='keep_going')
    def increment_until_all_done(
            x: qd.types.ndarray(qd.i32, ndim=1),
            thresholds: qd.types.ndarray(qd.i32, ndim=1),
            keep_going: qd.types.ndarray(qd.i32, ndim=0)):
        # Work: increment elements that haven't reached their threshold
        for i in range(x.shape[0]):
            if x[i] < thresholds[i]:
                x[i] = x[i] + 1

        # Reduction: reset flag, then OR-reduce per-element conditions
        for i in range(1):
            keep_going[None] = 0
        for i in range(x.shape[0]):
            if x[i] < thresholds[i]:
                keep_going[None] = 1

copilot:summary

Walkthrough

copilot:walkthrough

When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded
tasks) are captured into a CUDA graph on first launch and replayed on
subsequent launches, eliminating per-kernel launch overhead.

Uses the explicit graph node API (cuGraphAddKernelNode) with persistent
device arg/result buffers. Assumes stable ndarray device pointers.

Made-with: Cursor
Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in.
The flag flows from the Python decorator through LaunchContextBuilder
to the CUDA kernel launcher, avoiding interference with internal
kernels like ndarray_to_ext_arr.

Made-with: Cursor
Verify that cuda_graph=True is a harmless no-op on non-CUDA backends
(tested on x64/CPU). Passes on both x64 and CUDA.

Made-with: Cursor
On each graph replay, re-resolve ndarray device pointers and re-upload
the arg buffer to the persistent device buffer. This ensures correct
results when the kernel is called with different ndarrays after the
graph was first captured.

Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs().

Made-with: Cursor
Implements @qd.kernel(graph_while='flag_arg') which wraps the kernel
offloaded tasks in a CUDA conditional while node (requires SM 9.0+).
The named argument is a scalar i32 ndarray on device; the loop
continues while its value is non-zero.

Key implementation details:
- Condition kernel compiled as PTX and JIT-linked with libcudadevrt.a
  at runtime to access cudaGraphSetConditional device function
- CU_GRAPH_COND_ASSIGN_DEFAULT flag ensures handle is reset each launch
- Works with both counter-based (decrement to 0) and boolean flag
  (set to 0 when done) patterns
- graph_while implicitly enables cuda_graph=True

Tests: counter, boolean done flag, multiple loops, graph replay.
Made-with: Cursor
…allback

The graph_while_arg_id was computed using Python-level parameter indices,
which is wrong when struct parameters are flattened into many C++ args
(e.g. Genesis solver has 40 C++ params from 6 Python params). Now tracks
the flattened C++ arg index during launch context setup and caches it.

Also adds C++ do-while fallback loops for CPU, CUDA (non-graph path), and
AMDGPU backends so graph_while works identically on all platforms.

Made-with: Cursor
# Conflicts:
#	quadrants/runtime/amdgpu/kernel_launcher.cpp
#	quadrants/runtime/cuda/kernel_launcher.cpp
@hughperkins
Copy link
Copy Markdown
Collaborator Author

opus-4.6 review:

  1. Per-kernel CUDA graph capture/replay (cuda_graph=True)

The Python API is @qd.kernel(cuda_graph=True). When a kernel has 2+ top-level for loops (offloaded tasks), it captures them into a CUDA graph on first launch and replays
on subsequent launches, eliminating per-kernel launch overhead.
Key design decisions:
• Uses the explicit graph node API (cuGraphAddKernelNode) rather than stream capture -- good choice, gives more control and avoids the pitfalls of stream capture.
• Persistent device buffers for arg/result survive across replays. On each replay, the arg buffer is re-uploaded so changed ndarray pointers are picked up correctly.
• Falls back silently to the normal launch path for single-loop kernels or non-CUDA backends (harmless no-op).

  1. GPU-side conditional while loops (graph_while='flag_arg')

The API is @qd.kernel(graph_while='counter') where 'counter' names a scalar i32 ndarray argument. This wraps the kernel tasks in a CUDA conditional while node (SM 9.0+ /
Hopper required), enabling GPU-side iteration without CPU round-trips -- very relevant for convergence loops in physics solvers.
A small PTX condition kernel is JIT-linked with libcudadevrt.a at runtime to call cudaGraphSetConditional. Cross-platform fallback (CPU, AMDGPU, CUDA non-graph) uses a
host-side do-while loop with device readback.

Things that look good

• Clean API layering: flags flow from Python decorator through LaunchContextBuilder to the C++ launcher without polluting intermediate layers.
• Cross-platform parity: all three backends (CPU, CUDA, AMDGPU) get the graph_while do-while fallback, so semantics are identical.
• Arg-change handling: re-resolving ndarray device pointers and re-uploading the arg buffer on replay is the right approach.
• Test coverage: comprehensive -- two-loop, three-loop, single-loop fallback, no-annotation, changed args, graph_while with counter/boolean/multi-loop/replay, and
cross-backend variants.

Concerns / things I'd flag

  1. CudaGraphNodeParams struct layout is fragile -- it manually mirrors the CUDA driver API struct with hardcoded offsets and padding. If NVIDIA changes the layout
    between driver versions, this breaks silently. Consider using the actual CUDA headers or at least a static_assert on sizeof.
  2. No compute capability check -- the PTX targets sm_90 and conditional graph nodes require Hopper. If you run graph_while on an Ampere or older GPU, it will presumably
    fail at JIT link time with an unhelpful error. A runtime check with a clear error message would be better.
  3. libcudadevrt.a discovery is brittle -- only two hardcoded paths are checked. Custom toolkit installs (e.g. via CUDA_HOME or conda) won't be found. Should check
    environment variables like CUDA_HOME/CUDA_PATH.
  4. No graph cache eviction -- cuda_graph_cache_ grows monotonically. Each entry holds persistent device arg/result buffers. In a long-running process with many distinct
    kernels, this could leak significant GPU memory.
  5. Removed documentation comments -- the diff strips several explanatory comments from the existing CUDA kernel launcher (e.g., the purpose of transfers and
    device_ptrs). In a diff that adds substantial complexity, removing existing documentation is undesirable.
  6. Do-while semantics -- the body always executes at least once. This is documented in tests but could surprise users who set counter=0. Worth documenting explicitly in
    the API or raising an error for counter <= 0.
  7. Thread safety -- cuda_graph_cache_, cond_kernel_func_, and cond_kernel_module_ have no synchronization. Fine if Quadrants is single-threaded, but worth a comment
    noting this assumption.
  8. No test for autodiff interaction -- adjoint kernels also get use_cuda_graph and graph_while_arg set. There are no tests verifying this works (or is properly ignored)
    for reverse-mode AD kernels.

────────────────────────────────────────

Overall, it's a solid MVP. The architecture is sound, the explicit graph node API is the right choice, and the graph_while feature is a compelling addition for iterative
GPU workloads. The main risks are around portability (hardcoded paths, struct layouts, no SM version check) and missing edge-case handling.

Falls back to non-graph path with a warning on pre-Hopper GPUs,
instead of failing with an unhelpful JIT link error.
Checks env-var-derived paths before the hardcoded fallbacks, so
custom toolkit installs (e.g. conda, non-default prefix) are found.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

additional information from opus on point 4:

thoughts on 'No graph cache eviction'?

Honestly, on closer look I think this is a non-issue and I was wrong to flag it.
The cache is keyed by launch_id, which corresponds to a distinct compiled kernel registration — not to individual calls. So the cache has at most one entry per
@qd.kernel(cuda_graph=True) definition. It doesn't grow with the number of invocations.
In practice:
• Users explicitly opt in per-kernel, so the number of entries is small (realistically 1-10 in a typical app)
• Each entry holds two small device buffers (arg buffer is typically a few hundred bytes, result buffer is ~8 bytes minimum) plus one graph exec object — negligible GPU
memory

The only way this could blow up is if someone programmatically generates many distinct kernels with cuda_graph=True in a loop, which would be a misuse with much bigger
problems (JIT compilation cost per kernel).
I'd leave it as-is. Adding LRU eviction would be complexity for no real benefit.

Document cuda_graph=True and graph_while API in kernel() docstring,
and add a user guide page covering usage patterns, cross-platform
behavior, and the do-while semantics constraint.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

another opus update:

thoughts on 'Thread safety -- cuda_graph_cache_, cond_kernel_func_, and cond_kernel_module_ have no synchronization.'?

Not a real concern. Quadrants kernel launches go through a single KernelLauncher instance per device, and the launch path is inherently sequential -- you can't
meaningfully launch two CUDA kernels on the same stream from different threads simultaneously. The CUDA driver context itself serializes stream operations.
If Quadrants ever adds multi-threaded launch support, it would require much broader synchronization changes (the existing contexts_ vector has the same non-thread-safe
pattern), so cuda_graph_cache_ wouldn't be the first thing to break. Not worth adding a mutex for a scenario that doesn't exist.
I'd leave it as-is.

These flags were being set on the autodiff adjoint kernel without
any testing. Leave them off until the interaction with reverse-mode
AD is explicitly validated.
@hughperkins hughperkins marked this pull request as draft March 11, 2026 19:40
@hughperkins
Copy link
Copy Markdown
Collaborator Author

repalced by smaller prs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants