-
Notifications
You must be signed in to change notification settings - Fork 77
Add handle cache for AMD platform #698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces handle caching for the AMD platform to prevent reaching IPC handle limits when multiple processes open the same handles. NVIDIA GPUs handle reference counting internally, so this optimization is AMD-specific. The implementation uses a thread-safe cache with weak pointers to automatically reuse and release handles.
Key Changes:
- Adds custom hash and equality operators for
cudaIpcMemHandle_tto enable use instd::unordered_map - Implements
getPeerMemoryHandle()function with AMD-specific caching using weak pointers and mutex protection - Refactors
RegisteredMemory::Implto usestd::shared_ptrfor automatic IPC handle lifetime management
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| src/registered_memory.cc | Adds hash/equality operators for cudaIpcMemHandle_t, implements getPeerMemoryHandle with AMD-specific caching, updates constructor to use cached handles, and removes manual IPC handle cleanup from destructor |
| src/include/registered_memory.hpp | Adds peerHandle field to RegisteredMemory::Impl for managing IPC handle lifetime via shared_ptr |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@Binyang2014 I've opened a new pull request, #699, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
@Binyang2014 I've opened a new pull request, #700, to work on those changes. Once the pull request is ready, I'll request review from you. |
Co-authored-by: Copilot <[email protected]>
- [x] Move hash specialization and equality operator from std/global namespace to custom namespace - [x] Update unordered_map to use custom hash and equality as template parameters - [x] Add noexcept to equality operator - [x] Verify the changes build correctly - [x] Run code review and security checks <!-- START COPILOT CODING AGENT TIPS --> --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/mscclpp/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: copilot-swe-agent[bot] <[email protected]> Co-authored-by: Binyang2014 <[email protected]> Co-authored-by: Binyang Li <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
| return std::memcmp(lhs.reserved, rhs.reserved, sizeof(lhs.reserved)) == 0; | ||
| } | ||
| }; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add "#if defined(HIP_PLATFORM_AMD)" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two functions work for both AMD and nvidia platform. cudaIpcMemHandle_t has similar structure, we can keep it here
|
|
||
| TEST_F(ExecutorTest, TwoNodesAllreduce) { | ||
| if (gEnv->worldSize != 2 || gEnv->nRanksPerNode != 2) { | ||
| GTEST_SKIP() << "This test requires world size to be 2 and ranks per node to be 2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason this line is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It causes UT failure. I think gtest doesn't work well with MPI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this causing a failure. Is this specific to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no. I revert the change (just keep the customized hash function for cudaIpcHandle), still see this error. The error is: mpirun noticed that process rank 1 with PID 0 on node mscclpp-01 exited on signal 13 (Broken pipe). Might be caused by the order of deconstruct.
Move skip to SetUp function to mitigate this issue
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
|
||
| TEST_F(ExecutorTest, TwoNodesAllreduce) { | ||
| if (gEnv->worldSize != 2 || gEnv->nRanksPerNode != 2) { | ||
| GTEST_SKIP() << "This test requires world size to be 2 and ranks per node to be 2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen this causing a failure. Is this specific to this PR?
fcb1ab6 to
d97d230
Compare
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 3 pipeline(s). |
Introduce handle cache for AMD platform.
Avoid reaching handle limitation if we open too much IPC handles
For nvidia, we don't need this feature since nvidia will count the handle reference internally and reuse the same handle if already be opened