Skip to content

Fix memory leak by adding py::return_value_policy::move#7

Open
ehatami65 wants to merge 1 commit intonv-tlabs:mainfrom
ehatami65:fix-memory-leak
Open

Fix memory leak by adding py::return_value_policy::move#7
ehatami65 wants to merge 1 commit intonv-tlabs:mainfrom
ehatami65:fix-memory-leak

Conversation

@ehatami65
Copy link

Problem

The pybind11 bindings were missing an explicit return value policy, causing ~6MB GPU memory to leak per forward call. This led to OOM errors after ~2000 training steps.

Root Cause

Without py::return_value_policy::move, pybind11 keeps internal references to returned tensors, preventing Python/PyTorch from freeing GPU memory.

Fix

Add py::return_value_policy::move to both ppisp_forward and ppisp_backward bindings in ext.cpp.

Test Results

Version Memory after 200 calls
Before 2377 MB (leaked)
After 4 MB (stable)

The leaked ~6MB per call matched exactly the output tensor size (540×960×3×4 bytes), which led to identifying the binding layer as the source.

The pybind11 bindings were missing an explicit return value policy,
causing tensor references to not be properly released. This resulted
in ~6MB leaked per forward call, matching the output tensor size.

Adding py::return_value_policy::move transfers full ownership to Python,
allowing proper garbage collection.

Test results:
- Before: 2377 MB leaked over 200 iterations
- After: 4 MB stable (no leak)
@nvibd
Copy link
Collaborator

nvibd commented Jan 29, 2026

Thanks a lot for creating this PR! I will test this locally before merging, even though the change seems small.

I will probably also add your minimal example to the test suite so we don't regress on this issue later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments