Add handle cache for AMD platform #698

Binyang2014 · 2025-12-04T19:25:30Z

Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened

Copilot

Pull request overview

This PR introduces handle caching for the AMD platform to prevent reaching IPC handle limits when multiple processes open the same handles. NVIDIA GPUs handle reference counting internally, so this optimization is AMD-specific. The implementation uses a thread-safe cache with weak pointers to automatically reuse and release handles.

Key Changes:

Adds custom hash and equality operators for cudaIpcMemHandle_t to enable use in std::unordered_map
Implements getPeerMemoryHandle() function with AMD-specific caching using weak pointers and mutex protection
Refactors RegisteredMemory::Impl to use std::shared_ptr for automatic IPC handle lifetime management

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
src/registered_memory.cc	Adds hash/equality operators for cudaIpcMemHandle_t, implements getPeerMemoryHandle with AMD-specific caching, updates constructor to use cached handles, and removes manual IPC handle cleanup from destructor
src/include/registered_memory.hpp	Adds peerHandle field to RegisteredMemory::Impl for managing IPC handle lifetime via shared_ptr

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/registered_memory.cc

src/include/registered_memory.hpp

src/registered_memory.cc

Copilot · 2025-12-04T19:34:23Z

@Binyang2014 I've opened a new pull request, #699, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot · 2025-12-04T19:34:38Z

@Binyang2014 I've opened a new pull request, #700, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: Copilot <[email protected]>

- [x] Move hash specialization and equality operator from std/global namespace to custom namespace - [x] Update unordered_map to use custom hash and equality as template parameters - [x] Add noexcept to equality operator - [x] Verify the changes build correctly - [x] Run code review and security checks  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: Binyang2014 <[email protected]> Co-authored-by: Binyang Li <[email protected]>

Binyang2014 · 2025-12-04T19:46:28Z

/azp run

azure-pipelines · 2025-12-04T19:46:49Z

Azure Pipelines successfully started running 3 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/registered_memory.cc

src/include/registered_memory.hpp

Binyang2014 · 2025-12-04T23:47:09Z

/azp run

azure-pipelines · 2025-12-04T23:47:28Z

Azure Pipelines successfully started running 3 pipeline(s).

mahdiehghazim · 2025-12-08T19:10:54Z

src/registered_memory.cc

+    return std::memcmp(lhs.reserved, rhs.reserved, sizeof(lhs.reserved)) == 0;
+  }
+};
+


should we add "#if defined(HIP_PLATFORM_AMD)" here?

I think these two functions work for both AMD and nvidia platform. cudaIpcMemHandle_t has similar structure, we can keep it here

mahdiehghazim · 2025-12-08T20:12:52Z

test/mp_unit/executor_tests.cc


 TEST_F(ExecutorTest, TwoNodesAllreduce) {
  if (gEnv->worldSize != 2 || gEnv->nRanksPerNode != 2) {
-    GTEST_SKIP() << "This test requires world size to be 2 and ranks per node to be 2";


any reason this line is removed?

It causes UT failure. I think gtest doesn't work well with MPI

I haven't seen this causing a failure. Is this specific to this PR?

I think no. I revert the change (just keep the customized hash function for cudaIpcHandle), still see this error. The error is: mpirun noticed that process rank 1 with PID 0 on node mscclpp-01 exited on signal 13 (Broken pipe). Might be caused by the order of deconstruct.
Move skip to SetUp function to mitigate this issue

Binyang2014 · 2025-12-12T07:58:23Z

/azp run

azure-pipelines · 2025-12-12T07:58:42Z

Azure Pipelines successfully started running 3 pipeline(s).

src/include/registered_memory.hpp

src/registered_memory.cc

chhwang · 2025-12-12T18:55:57Z

test/mp_unit/executor_tests.cc


 TEST_F(ExecutorTest, TwoNodesAllreduce) {
  if (gEnv->worldSize != 2 || gEnv->nRanksPerNode != 2) {
-    GTEST_SKIP() << "This test requires world size to be 2 and ranks per node to be 2";


I haven't seen this causing a failure. Is this specific to this PR?

src/registered_memory.cc

Binyang2014 · 2025-12-13T12:04:25Z

/azp run

azure-pipelines · 2025-12-13T12:04:43Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 · 2025-12-15T03:44:43Z

/azp run

azure-pipelines · 2025-12-15T03:45:03Z

Azure Pipelines successfully started running 3 pipeline(s).

Binyang2014 added 2 commits December 4, 2025 19:20

add ipc cache

70c1d4d

WIP

1739f5a

Binyang2014 requested a review from Copilot December 4, 2025 19:25

Binyang2014 marked this pull request as ready for review December 4, 2025 19:25

Copilot started reviewing on behalf of Binyang2014 December 4, 2025 19:26 View session

Copilot finished reviewing on behalf of Binyang2014 December 4, 2025 19:30

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Copilot AI mentioned this pull request Dec 4, 2025

[WIP] Address feedback on handle cache implementation for AMD platform #699

Closed

6 tasks

Copilot AI mentioned this pull request Dec 4, 2025

Move cudaIpcMemHandle_t hash and equality to custom namespace #700

Merged

Binyang2014 and others added 3 commits December 4, 2025 11:36

Update src/registered_memory.cc

4ebe37e

Co-authored-by: Copilot <[email protected]>

WIP

2137325

Binyang2014 requested a review from Copilot December 4, 2025 19:44

Copilot started reviewing on behalf of Binyang2014 December 4, 2025 19:44 View session

Copilot finished reviewing on behalf of Binyang2014 December 4, 2025 19:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

src/registered_memory.cc Show resolved Hide resolved

src/registered_memory.cc Show resolved Hide resolved

src/include/registered_memory.hpp Show resolved Hide resolved

Binyang2014 requested review from chhwang, mahdiehghazim and seagater December 4, 2025 21:24

fix ut

b1029b9

mahdiehghazim reviewed Dec 8, 2025

View reviewed changes

Merge branch 'main' into binyli/handle_cache

d97d230

chhwang reviewed Dec 12, 2025

View reviewed changes

chhwang force-pushed the binyli/handle_cache branch from fcb1ab6 to d97d230 Compare December 13, 2025 00:49

Binyang2014 added 2 commits December 13, 2025 09:35

fix ci

09d6b70

address comment

01fcb32

update

4acf3a9

update for log

e283c5d

Add handle cache for AMD platform #698

Are you sure you want to change the base?

Add handle cache for AMD platform #698

Conversation

Binyang2014 commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Dec 4, 2025

Uh oh!

Copilot AI commented Dec 4, 2025

Uh oh!

Binyang2014 commented Dec 4, 2025

Uh oh!

azure-pipelines bot commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Binyang2014 commented Dec 4, 2025

Uh oh!

azure-pipelines bot commented Dec 4, 2025

Uh oh!

mahdiehghazim Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

mahdiehghazim Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

chhwang Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Binyang2014 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Binyang2014 commented Dec 12, 2025

Uh oh!

azure-pipelines bot commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

chhwang Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Binyang2014 commented Dec 13, 2025

Uh oh!

azure-pipelines bot commented Dec 13, 2025

Uh oh!

Binyang2014 commented Dec 15, 2025

Uh oh!

azure-pipelines bot commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants