Skip to content

Conversation

@Binyang2014
Copy link
Contributor

Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles

For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces handle caching for the AMD platform to prevent reaching IPC handle limits when multiple processes open the same handles. NVIDIA GPUs handle reference counting internally, so this optimization is AMD-specific. The implementation uses a thread-safe cache with weak pointers to automatically reuse and release handles.

Key Changes:

  • Adds custom hash and equality operators for cudaIpcMemHandle_t to enable use in std::unordered_map
  • Implements getPeerMemoryHandle() function with AMD-specific caching using weak pointers and mutex protection
  • Refactors RegisteredMemory::Impl to use std::shared_ptr for automatic IPC handle lifetime management

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
src/registered_memory.cc Adds hash/equality operators for cudaIpcMemHandle_t, implements getPeerMemoryHandle with AMD-specific caching, updates constructor to use cached handles, and removes manual IPC handle cleanup from destructor
src/include/registered_memory.hpp Adds peerHandle field to RegisteredMemory::Impl for managing IPC handle lifetime via shared_ptr

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI commented Dec 4, 2025

@Binyang2014 I've opened a new pull request, #699, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Contributor

Copilot AI commented Dec 4, 2025

@Binyang2014 I've opened a new pull request, #700, to work on those changes. Once the pull request is ready, I'll request review from you.

Binyang2014 and others added 3 commits December 4, 2025 11:36
- [x] Move hash specialization and equality operator from std/global
namespace to custom namespace
- [x] Update unordered_map to use custom hash and equality as template
parameters
- [x] Add noexcept to equality operator
- [x] Verify the changes build correctly
- [x] Run code review and security checks

<!-- START COPILOT CODING AGENT TIPS -->
---

✨ Let Copilot coding agent [set things up for
you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot)
— coding agent works faster and does higher quality work when set up for
your repo.

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Binyang2014 <[email protected]>
Co-authored-by: Binyang Li <[email protected]>
@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

return std::memcmp(lhs.reserved, rhs.reserved, sizeof(lhs.reserved)) == 0;
}
};

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add "#if defined(HIP_PLATFORM_AMD)" here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two functions work for both AMD and nvidia platform. cudaIpcMemHandle_t has similar structure, we can keep it here


TEST_F(ExecutorTest, TwoNodesAllreduce) {
if (gEnv->worldSize != 2 || gEnv->nRanksPerNode != 2) {
GTEST_SKIP() << "This test requires world size to be 2 and ranks per node to be 2";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason this line is removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It causes UT failure. I think gtest doesn't work well with MPI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this causing a failure. Is this specific to this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no. I revert the change (just keep the customized hash function for cudaIpcHandle), still see this error. The error is: mpirun noticed that process rank 1 with PID 0 on node mscclpp-01 exited on signal 13 (Broken pipe). Might be caused by the order of deconstruct.
Move skip to SetUp function to mitigate this issue

@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).


TEST_F(ExecutorTest, TwoNodesAllreduce) {
if (gEnv->worldSize != 2 || gEnv->nRanksPerNode != 2) {
GTEST_SKIP() << "This test requires world size to be 2 and ranks per node to be 2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen this causing a failure. Is this specific to this PR?

@chhwang chhwang force-pushed the binyli/handle_cache branch from fcb1ab6 to d97d230 Compare December 13, 2025 00:49
@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@Binyang2014
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants