Skip to content

Fix PinnedMemoryResource IPC NUMA ID derivation#1699

Open
Andy-Jost wants to merge 3 commits intoNVIDIA:mainfrom
Andy-Jost:refactor-mempool-hierarchy
Open

Fix PinnedMemoryResource IPC NUMA ID derivation#1699
Andy-Jost wants to merge 3 commits intoNVIDIA:mainfrom
Andy-Jost:refactor-mempool-hierarchy

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Feb 27, 2026

Summary

  • Fixes [BUG]: IPC-enabled pinned pool uses a fixed host NUMA node #1603: PinnedMemoryResource(ipc_enabled=True) hardcoded host NUMA ID 0, causing failures on multi-NUMA systems when the active device is on a different NUMA node
  • Adds numa_id option to PinnedMemoryResourceOptions for explicit NUMA node selection
  • Refactors _MemPool hierarchy to separate shared pool machinery from device-specific concerns

Changes

Bugfix (_pinned_memory_resource.pyx):

  • When ipc_enabled=True and numa_id is not specified, derives the NUMA node from the current device's host_numa_id attribute (requires active CUDA context)
  • Adds numa_id: int | None = None to PinnedMemoryResourceOptions
  • Adds numa_id property to PinnedMemoryResource
  • Removes _check_numa_nodes warning machinery in favor of proper NUMA node selection

Refactor (_MemPool hierarchy):

  • Moves _dev_id, device_id, and peer_accessible_by from _MemPool into DeviceMemoryResource
  • Eliminates _MemPoolOptions; pool initialization refactored into freestanding cdef functions (MP_init_create_pool, MP_init_current_pool, MP_raise_release_threshold)
  • Extracts __init__ bodies into inline cdef helpers (_DMR_init, _PMR_init, _MMR_init)
  • Implements device_id as -1 for PinnedMemoryResource and ManagedMemoryResource

Test Coverage

  • 4 new tests covering the numa_id behavior matrix: default without IPC, default with IPC, explicit NUMA ID, negative NUMA ID error
  • Updated existing IPC tests to assert numa_id values
  • All existing memory and IPC tests pass (pinned, device, managed, peer access, IPC)

Made with Cursor

…ce-specific concerns

Move _dev_id, device_id, and peer_accessible_by from _MemPool into
DeviceMemoryResource. Eliminate _MemPoolOptions and refactor pool
initialization into freestanding cdef functions (MP_init_create_pool,
MP_init_current_pool, MP_raise_release_threshold) for cross-module
visibility. Extract __init__ bodies into inline cdef helpers (_DMR_init,
_PMR_init, _MMR_init) for consistency and shorter class definitions.

Implements device_id as -1 for PinnedMemoryResource and
ManagedMemoryResource since they are not device-bound.

Made-with: Cursor
…IDIA#1603)

PinnedMemoryResource(ipc_enabled=True) hardcoded host NUMA ID 0, causing
failures on multi-NUMA systems where the active device is attached to a
different NUMA node. Now derives the NUMA ID from the current device's
host_numa_id attribute, and adds an explicit numa_id option for manual
override. Removes the _check_numa_nodes warning machinery in favor of
proper NUMA node selection.

Made-with: Cursor
@Andy-Jost Andy-Jost added this to the cuda.core v0.7.0 milestone Feb 27, 2026
@Andy-Jost Andy-Jost added bug Something isn't working cuda.core Everything related to the cuda.core module labels Feb 27, 2026
@Andy-Jost Andy-Jost self-assigned this Feb 27, 2026
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test e55a26b

@github-actions
Copy link

Comment on lines +69 to +72
@property
def device_id(self) -> int:
"""Return -1. Managed memory migrates automatically and is not tied to a specific device."""
return -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Wouldn't this be a breaking change? In the old implementation, device_id was used to initialize _MemPool. _dev_id, which is used to back .device_id, but the new implementation returns .device_id = -1 unconditionally. I understand we meant to say the pages are migratable (not pinned), but maybe there is a better way to restore the capability of querying the preferred location?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: IPC-enabled pinned pool uses a fixed host NUMA node

2 participants