Skip to content

Properly backport #947 to fix callback shutdown deadlock (humble)#1671

Open
dbwls99706 wants to merge 4 commits into
ros2:humblefrom
dbwls99706:fix-callback-shutdown-humble-v2
Open

Properly backport #947 to fix callback shutdown deadlock (humble)#1671
dbwls99706 wants to merge 4 commits into
ros2:humblefrom
dbwls99706:fix-callback-shutdown-humble-v2

Conversation

@dbwls99706

@dbwls99706 dbwls99706 commented May 20, 2026

Copy link
Copy Markdown

Description

Properly backports #947 to humble by reverting the incomplete fix from #1492 and applying #947 directly via cherry-pick.

This PR was redone after @Ryan4253's review, which pointed out that the previous approach (layering an additional fix on top of #1492) left the ineffective guard in place and grew the rolling/humble drift.

Background

#1492 was intended to backport #947, but added an early-return guard at the top of Executor.shutdown() instead of the actual fix. The deadlock path (calling rclpy.shutdown() from within a callback) was not addressed because the first invocation finds _is_shutdown == False, falls through the guard, and then calls _work_tracker.wait(), which waits for the currently-executing callback - the source of the deadlock.

The unit test added in #1492 used executor.spin_once() in a manual loop and never triggered executor.shutdown() from inside a callback, so it did not exercise the deadlock condition.

Credit to @Ryan4253 for the diagnosis in #1646 and the suggestion to use git revert + git cherry-pick.

Changes

This PR contains three commits:

  1. Revert of Fix: deadlock when calling rclpy.shutdown() from callbacks (backport #947) #1492 (8190ea0) - removes the ineffective guard and the misleading test.
  2. Cherry-pick of fix rclpy.shutdown() from hanging when triggered from callback #947 (61ff49e, authored by @ihasdapie) - applies the original upstream fix as-is. _WorkTracker.wait() accepts an is_shutdown parameter, and the wait_for predicate returns immediately when shutdown has been signalled, releasing the calling callback.
  3. Lint fix - removes an extra blank line introduced by the conflict resolution between the cherry-pick and test_not_lose_callback (a Humble-only test).

After this PR, rclpy/rclpy/executors.py and rclpy/test/test_executor.py match the corresponding state on rolling, eliminating drift.

Verification

Reproducer from #1646 (Ubuntu 22.04 / Humble):

Before this PR:

[INFO] [...] Node is running, waiting for timer...
[INFO] [...] Timer triggered, shutting down...
Killed
exit code: 137

After this PR:

[INFO] [...] Node is running, waiting for timer...
[INFO] [...] Timer triggered, shutting down...
DONE
exit code: 0

colcon test --packages-select rclpy passes locally. The only failures (test_clock STEADY_TIME variants) reproduce on the unmodified upstream/humble baseline and are unrelated to this change (WSL environment).

Fixes #1646.
Properly backports #947.

Is this user-facing behavior change?

Yes. Calling rclpy.shutdown() from within a callback (timer, subscription, etc.) on Humble no longer deadlocks the process. This matches the behavior on rolling.

Did you use Generative AI?

Yes - used Claude to assist with drafting the PR description. The fix itself is the upstream commit from #947 (@ihasdapie); the revert + cherry-pick approach was suggested by @Ryan4253; local verification was done by me.

@Ryan4253

Ryan4253 commented May 20, 2026

Copy link
Copy Markdown
  1. We should also revert the changes from Fix: deadlock when calling rclpy.shutdown() from callbacks (backport #947) #1492 since they don't actually resolve the deadlock path
  2. Why don't we simply cherry-pick the original fix from fix rclpy.shutdown() from hanging when triggered from callback #947? The changes in this PR is still a custom fix

I was able to do both very easily in https://github.com/Ryan4253/rclpy/commits/fix-callback-shutdown/ by running the following:

git revert 8190ea0
git cherry-pick -m 1 61ff49e

Given the above, I'm confused about why we're introducing another custom implementation while marketing it as "properly backporting #947" instead of simply applying #947 directly. This:

  1. Creates additional version drift from rolling which makes cherry-picking future changes harder
  2. Increases the chance of regression against an already validated upstream fix, which has already happened once

@dbwls99706

dbwls99706 commented May 20, 2026

Copy link
Copy Markdown
Author

@Ryan4253 You're right - layering on top of #1492 keeps the dead guard and grows the drift. Redoing this PR with git revert + git cherry-pick as suggested. Force-pushing shortly.

@dbwls99706 dbwls99706 force-pushed the fix-callback-shutdown-humble-v2 branch from bf1e719 to 70a77be Compare May 20, 2026 23:27
dbwls99706 and others added 3 commits May 21, 2026 08:28
…ackport ros2#947) (ros2#1492)"

This reverts commit 8190ea0.

Signed-off-by: dbwls99706 <yujinhong3@gmail.com>
fix rclpy.shutdown() from hanging when triggered from callback

Signed-off-by: dbwls99706 <yujinhong3@gmail.com>
The cherry-pick of ros2#947 introduced an extra blank line between
test_not_lose_callback (Humble-only) and the preceding method,
because the conflict resolution kept both methods. Reduce to a
single blank line to satisfy flake8 E303.

Signed-off-by: dbwls99706 <yujinhong3@gmail.com>
@dbwls99706 dbwls99706 force-pushed the fix-callback-shutdown-humble-v2 branch from 70a77be to 9657855 Compare May 20, 2026 23:29

@fujitatomoya fujitatomoya left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with green CI.

@fujitatomoya

Copy link
Copy Markdown
Collaborator

@dbwls99706 thanks for fixing this up. there has been tic-toc for commits to backport, after all i believe that this is the proper way to backport the fix from the upstream branch to keep the clear git history to track.

@fujitatomoya

Copy link
Copy Markdown
Collaborator

Pulls: #1671
Gist: https://gist.githubusercontent.com/fujitatomoya/8b695df63dc436cc029fa1bbc1ee73fb/raw/3c1bce5c620279e815189e2d03f9659f77ff3754/ros2.repos
BUILD args: --packages-above-and-dependencies rclpy
TEST args: --packages-above rclpy
ROS Distro: humble
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/19336

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Ryan4253

Ryan4253 commented May 22, 2026

Copy link
Copy Markdown

Looks like all the unit tests failed with a deprecation warning seemingly unrelated to the PR

INTERNALERROR> DeprecationWarning: The 'asyncio_mode' default value will change to 'strict' in future, please explicitly use 'asyncio_mode=strict' or 'asyncio_mode=auto' in pytest configuration file.

I did some preliminary investigation and couldn't find anything related to asyncio_mode within the repo. The error points to a newer version of pytest-asyncio being used but more recent CIs seems to run just fine. I can't access any older CI ran on humble so I can't confirm if this is causing a regression.

I can investigate deeper in ~10 hours.

@Ryan4253

Ryan4253 commented May 22, 2026

Copy link
Copy Markdown

Some things I forgot to note:

  1. The unit test name shutdown_executor_from_callback wouldn't get discovered by pytest since it does not start with test_. This is an issue from fix rclpy.shutdown() from hanging when triggered from callback #947 and was fixed upstream in Introduce EventsExecutor implementation #1391. We should probably fix it within this PR directly.

  2. There is a contract violation with the original fix in fix rclpy.shutdown() from hanging when triggered from callback #947. Based on the docs of shutdown it will wait until all workers complete (excerpt from rolling)

    def shutdown(self, timeout_sec: Optional[float] = None) -> bool:
        """
        Stop executing callbacks and wait for their completion.

        :param timeout_sec: Seconds to wait. Block forever if ``None`` or negative.
            Don't wait if 0.
        :return: ``True`` if all outstanding callbacks finished executing, or ``False`` if the
            timeout expires before all outstanding work is done.
        """
        with self._shutdown_lock:
            if not self._is_shutdown:
                self._is_shutdown = True
                # Tell executor it's been shut down
                if self._guard:
                    self._guard.trigger()
        if not self._is_shutdown:
            if not self._work_tracker.wait(timeout_sec):
                return False

After we exit the with self._shutdown_lock: block self._is_shutdown is guaranteed to be true. This means that this block will never execute:

        if not self._is_shutdown:
            if not self._work_tracker.wait(timeout_sec):
                return False

So we never actually wait for all workers to complete which is a contract violation. This is also why this comment exists in shutdown_executor_from_callback on rolling:

    def test_shutdown_executor_from_callback(self) -> None:
        """https://github.com/ros2/rclpy/issues/944: allow for executor shutdown from callback."""
        self.assertIsNotNone(self.node.handle)
        timer_period = 0.1
        # TODO(bmartin427) This seems like an invalid test to me?  executor.shutdown() is
        # documented as blocking until all callbacks are complete, unless you pass a non-negative
        # timeout value which this doesn't.  I'm not sure how that's supposed to *not* deadlock if
        # you block on all callbacks from within a callback.

(the answer is that the documentation doesn't match the behavior and we never wait for all callbacks to complete)

A more concrete fix that respects the contracts would be

  1. Detect if the current context calling shutdown() is in a callback and if yes wait until there are one worker remaining
  2. Require the user to supply a reasonably long enough timeout when shutting down from a callback so all callbacks complete and we just let the timeout to resolve the deadlock

Option 1 is safer for users, but it would require a more substantial code change and still would not perfectly satisfy the contract of waiting for all work to complete. Option 2 would not require code changes beyond reverting #947, but it would diverge from rclcpp behavior.

Either way this is better left as a fix in rolling rather than being addressed in this PR.

The test name from ros2#947 lacked the 'test_' prefix, so pytest did not
discover it. The rename was applied upstream as part of ros2#1391;
cherry-picking that change here so the test actually runs on Humble.

Signed-off-by: dbwls99706 <yujinhong3@gmail.com>
@dbwls99706

Copy link
Copy Markdown
Author

@Ryan4253 Thanks for the careful read. On (1): agreed, added the test_ prefix in a new commit. On (2): agreed it's a rolling-level concern, leaving it out of this PR.

@fujitatomoya the linux / linux-aarch64 failures look unrelated to this PR - ci_linux-rhel passed rclpy 443/443 including the cherry-picked test, and the other runs aborted before producing results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants