Fix memory leak by adding py::return_value_policy::move#7
Open
ehatami65 wants to merge 1 commit intonv-tlabs:mainfrom
Open
Fix memory leak by adding py::return_value_policy::move#7ehatami65 wants to merge 1 commit intonv-tlabs:mainfrom
ehatami65 wants to merge 1 commit intonv-tlabs:mainfrom
Conversation
The pybind11 bindings were missing an explicit return value policy, causing tensor references to not be properly released. This resulted in ~6MB leaked per forward call, matching the output tensor size. Adding py::return_value_policy::move transfers full ownership to Python, allowing proper garbage collection. Test results: - Before: 2377 MB leaked over 200 iterations - After: 4 MB stable (no leak)
Collaborator
|
Thanks a lot for creating this PR! I will test this locally before merging, even though the change seems small. I will probably also add your minimal example to the test suite so we don't regress on this issue later. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The pybind11 bindings were missing an explicit return value policy, causing ~6MB GPU memory to leak per forward call. This led to OOM errors after ~2000 training steps.
Root Cause
Without
py::return_value_policy::move, pybind11 keeps internal references to returned tensors, preventing Python/PyTorch from freeing GPU memory.Fix
Add
py::return_value_policy::moveto bothppisp_forwardandppisp_backwardbindings inext.cpp.Test Results
The leaked ~6MB per call matched exactly the output tensor size (540×960×3×4 bytes), which led to identifying the binding layer as the source.