Skip to content

[Perf] Streams 1: Add CUDA stream and event API#407

Draft
hughperkins wants to merge 2 commits intomainfrom
hp/streams-quadrantsic-1-cuda-streams
Draft

[Perf] Streams 1: Add CUDA stream and event API#407
hughperkins wants to merge 2 commits intomainfrom
hp/streams-quadrantsic-1-cuda-streams

Conversation

@hughperkins
Copy link
Collaborator

@hughperkins hughperkins commented Mar 11, 2026

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

Lines of code added: +481 - 197 - 4 - 4 = +276

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Introduces qd.create_stream() and qd.create_event() for launching
kernels on separate CUDA streams with event-based synchronization.
The qd_stream kwarg on kernel calls routes the launch to a specific
stream. Non-CUDA backends return no-op handles (0). Routes kernel
launcher memory ops through the active stream.
@hughperkins hughperkins marked this pull request as draft March 11, 2026 23:52
- Make CUDAContext::stream_ thread_local for thread-safety
- Convert sync memcpy_host_to_device to async on active_stream
- Use weakref in Stream/Event __del__ to safely handle interpreter shutdown
- Add __enter__/__exit__ context manager support for Stream and Event
- Use consistent qd_stream parameter naming in Event.record and Event.wait
- Add handle==0 guard to stream_synchronize
@hughperkins
Copy link
Collaborator Author

Review from Opus (written before the last commit above):

PR Review: Add CUDA Stream and Event API

Branch: hp/streams-quadrantsic-1-cuda-streams
Commit: ab15b1b82 — "Add CUDA stream and event API for concurrent kernel execution"
Files changed: 11 (+443, -16)


Summary

This PR introduces a CUDA stream and event API to enable concurrent kernel execution on separate GPU streams. It adds:

  • Python API (stream.py): Stream and Event wrapper classes with create_stream() / create_event() factory functions
  • C++ backend (program.cpp/h): 9 new Program methods wrapping CUDA driver calls for stream/event lifecycle
  • Kernel launch integration (kernel.py): A qd_stream= kwarg on kernel calls that sets the active CUDA stream around launch
  • Kernel launcher fix (kernel_launcher.cpp): Replaces hardcoded nullptr stream with CUDAContext::get_instance().get_stream() so that async memory ops respect the active stream
  • pybind11 exports and tests (197 lines of test coverage)

The design is clean and well-layered. On non-CUDA backends, everything degrades to no-ops (handle=0).


Issues and Concerns

1. Thread-safety of CUDAContext::stream_ (High)

CUDAContext is a singleton. The set_stream / get_stream methods read/write a bare void *stream_ with no synchronization:

// quadrants/rhi/cuda/cuda_context.h:116-118
void set_stream(void *stream) {
    stream_ = stream;
}

The context already has a std::mutex lock_ (line 27), but it is not used here. The kernel launch path in kernel.py does set_current_cuda_stream → launch_kernel → set_current_cuda_stream(0), which is not atomic. If two Python threads launch kernels on different streams, they'll race on stream_, and a kernel could be launched on the wrong stream.

Suggestion: Either protect stream_ with the existing mutex, use thread_local storage for the active stream, or document that concurrent multi-threaded kernel launches with different streams are unsupported.

2. Synchronous memcpy_host_to_device not updated (Medium)

In kernel_launcher.cpp, the PR correctly updates all async operations to use active_stream, but lines 90–101 still use synchronous memcpy_host_to_device for external host array transfers:

// quadrants/runtime/cuda/kernel_launcher.cpp:90-91
CUDADriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

Synchronous cuMemcpyHtoD implicitly synchronizes the default stream. When a user launches on a non-default stream with host-backed external arrays, this will introduce unintended synchronization with the default stream, potentially defeating the purpose of using separate streams.

Suggestion: Convert these to memcpy_host_to_device_async on active_stream, consistent with the rest of the changes.

3. __del__ calling into runtime during interpreter shutdown (Medium)

Both Stream.__del__ and Event.__del__ call self.destroy(), which accesses impl.get_runtime().prog:

# python/quadrants/lang/stream.py:31-36
def __del__(self):
    if self._handle != 0:
        try:
            self.destroy()
        except Exception:
            pass

During Python interpreter shutdown, the runtime/program may already be finalized. The bare except Exception: pass mitigates crashes, but leaked CUDA resources are still possible. Additionally, __del__ timing is non-deterministic — the CUDA context itself could be destroyed before these finalizers run.

Suggestion: Consider registering streams/events with the runtime for batch cleanup at shutdown, or use weakref.ref to the program to detect whether cleanup is still possible. At minimum, add a note in the docstring encouraging explicit destroy() calls or context manager usage.

4. Inconsistent parameter naming in Event API (Low)

Event.record uses stream as its parameter name, while Event.wait uses qd_stream:

# python/quadrants/lang/stream.py:52-53
def record(self, stream: Stream | None = None):
    """Record this event on a stream. None means the default stream."""
# python/quadrants/lang/stream.py:58-59
def wait(self, qd_stream: Stream | None = None):
    """Make a stream wait for this event. None means the default stream."""

I understand qd_stream is used to avoid colliding with the kernel **kwargs namespace, but that concern doesn't apply to Event.wait — it's a standalone method, not a kernel call. These should be consistent. I'd suggest using stream in both places since the qd_stream prefix is an implementation detail of the kernel dispatch path.

5. stream_synchronize doesn't guard against handle=0 (Low)

stream_destroy and event_destroy guard against handle == 0, but stream_synchronize does not:

// quadrants/program/program.cpp:527-533
void Program::stream_synchronize(uint64 stream_handle) {
#ifdef QD_WITH_CUDA
  if (compile_config().arch == Arch::cuda) {
    CUDADriver::get_instance().stream_synchronize(
        reinterpret_cast<void *>(stream_handle));
  }
#endif

Calling cuStreamSynchronize(nullptr) synchronizes the default stream, which is technically valid but may be unintentional. The Python Stream class with handle == 0 (CPU backend) will reach this code path and unintentionally synchronize the CUDA default stream if the arch somehow doesn't match. The guard pattern should be consistent across all methods.


Minor Observations

  • Test coverage is solid. The test suite covers lifecycle, multi-stream, event sync, CPU no-op, and ndarray. Consider adding a test for using a destroyed stream (expect graceful no-op or error) and a test for multiple event records on the same event.

  • No context manager support. Stream would benefit from __enter__/__exit__ so users can write with qd.create_stream() as s: for automatic cleanup. This is a nice-to-have for a follow-up.

  • test_kernel_on_stream and similar tests use @test_utils.test() (all arches) which means they run on CPU where create_stream() returns handle=0 and the stream kwarg is a no-op. This tests the no-crash path, which is fine, but the concurrency behavior is only actually tested when arch=[qd.cuda]. Consider being explicit about which tests are CUDA-specific correctness tests vs. CPU-graceful-degradation tests.


Verdict

The PR is well-structured and the API is clean. The kernel launcher changes correctly propagating the active stream to all async memory operations are the most important part and look correct. I'd recommend addressing the thread-safety concern (#1) before merge — even if only with documentation — and fixing the synchronous memcpy (#2) to avoid subtle correctness issues when users mix host arrays with non-default streams. The naming inconsistency (#4) is a quick fix worth doing now to avoid API churn later.

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 01:19
@hughperkins hughperkins changed the title [Perf] CUDA Streams 1: Add CUDA stream and event API [Perf] Streams 1: Add CUDA stream and event API Mar 12, 2026
@hughperkins hughperkins marked this pull request as draft March 12, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant