Properly backport #947 to fix callback shutdown deadlock (humble)#1671
Properly backport #947 to fix callback shutdown deadlock (humble)#1671dbwls99706 wants to merge 4 commits into
Conversation
I was able to do both very easily in https://github.com/Ryan4253/rclpy/commits/fix-callback-shutdown/ by running the following: Given the above, I'm confused about why we're introducing another custom implementation while marketing it as "properly backporting #947" instead of simply applying #947 directly. This:
|
bf1e719 to
70a77be
Compare
fix rclpy.shutdown() from hanging when triggered from callback Signed-off-by: dbwls99706 <yujinhong3@gmail.com>
The cherry-pick of ros2#947 introduced an extra blank line between test_not_lose_callback (Humble-only) and the preceding method, because the conflict resolution kept both methods. Reduce to a single blank line to satisfy flake8 E303. Signed-off-by: dbwls99706 <yujinhong3@gmail.com>
70a77be to
9657855
Compare
fujitatomoya
left a comment
There was a problem hiding this comment.
lgtm with green CI.
|
@dbwls99706 thanks for fixing this up. there has been tic-toc for commits to backport, after all i believe that this is the proper way to backport the fix from the upstream branch to keep the clear git history to track. |
|
Pulls: #1671 |
|
Looks like all the unit tests failed with a deprecation warning seemingly unrelated to the PR I did some preliminary investigation and couldn't find anything related to I can investigate deeper in ~10 hours. |
|
Some things I forgot to note:
def shutdown(self, timeout_sec: Optional[float] = None) -> bool:
"""
Stop executing callbacks and wait for their completion.
:param timeout_sec: Seconds to wait. Block forever if ``None`` or negative.
Don't wait if 0.
:return: ``True`` if all outstanding callbacks finished executing, or ``False`` if the
timeout expires before all outstanding work is done.
"""
with self._shutdown_lock:
if not self._is_shutdown:
self._is_shutdown = True
# Tell executor it's been shut down
if self._guard:
self._guard.trigger()
if not self._is_shutdown:
if not self._work_tracker.wait(timeout_sec):
return FalseAfter we exit the if not self._is_shutdown:
if not self._work_tracker.wait(timeout_sec):
return FalseSo we never actually wait for all workers to complete which is a contract violation. This is also why this comment exists in def test_shutdown_executor_from_callback(self) -> None:
"""https://github.com/ros2/rclpy/issues/944: allow for executor shutdown from callback."""
self.assertIsNotNone(self.node.handle)
timer_period = 0.1
# TODO(bmartin427) This seems like an invalid test to me? executor.shutdown() is
# documented as blocking until all callbacks are complete, unless you pass a non-negative
# timeout value which this doesn't. I'm not sure how that's supposed to *not* deadlock if
# you block on all callbacks from within a callback.(the answer is that the documentation doesn't match the behavior and we never wait for all callbacks to complete) A more concrete fix that respects the contracts would be
Option 1 is safer for users, but it would require a more substantial code change and still would not perfectly satisfy the contract of waiting for all work to complete. Option 2 would not require code changes beyond reverting #947, but it would diverge from rclcpp behavior. Either way this is better left as a fix in rolling rather than being addressed in this PR. |
|
@Ryan4253 Thanks for the careful read. On (1): agreed, added the @fujitatomoya the linux / linux-aarch64 failures look unrelated to this PR - |
Description
Properly backports #947 to humble by reverting the incomplete fix from #1492 and applying #947 directly via cherry-pick.
This PR was redone after @Ryan4253's review, which pointed out that the previous approach (layering an additional fix on top of #1492) left the ineffective guard in place and grew the rolling/humble drift.
Background
#1492 was intended to backport #947, but added an early-return guard at the top of
Executor.shutdown()instead of the actual fix. The deadlock path (callingrclpy.shutdown()from within a callback) was not addressed because the first invocation finds_is_shutdown == False, falls through the guard, and then calls_work_tracker.wait(), which waits for the currently-executing callback - the source of the deadlock.The unit test added in #1492 used
executor.spin_once()in a manual loop and never triggeredexecutor.shutdown()from inside a callback, so it did not exercise the deadlock condition.Credit to @Ryan4253 for the diagnosis in #1646 and the suggestion to use
git revert + git cherry-pick.Changes
This PR contains three commits:
8190ea0) - removes the ineffective guard and the misleading test.61ff49e, authored by @ihasdapie) - applies the original upstream fix as-is._WorkTracker.wait()accepts anis_shutdownparameter, and thewait_forpredicate returns immediately when shutdown has been signalled, releasing the calling callback.test_not_lose_callback(a Humble-only test).After this PR,
rclpy/rclpy/executors.pyandrclpy/test/test_executor.pymatch the corresponding state on rolling, eliminating drift.Verification
Reproducer from #1646 (Ubuntu 22.04 / Humble):
Before this PR:
After this PR:
colcon test --packages-select rclpypasses locally. The only failures (test_clockSTEADY_TIMEvariants) reproduce on the unmodified upstream/humble baseline and are unrelated to this change (WSL environment).Fixes #1646.
Properly backports #947.
Is this user-facing behavior change?
Yes. Calling
rclpy.shutdown()from within a callback (timer, subscription, etc.) on Humble no longer deadlocks the process. This matches the behavior on rolling.Did you use Generative AI?
Yes - used Claude to assist with drafting the PR description. The fix itself is the upstream commit from #947 (@ihasdapie); the
revert + cherry-pickapproach was suggested by @Ryan4253; local verification was done by me.