Properly backport #947 to fix callback shutdown deadlock (humble) by dbwls99706 · Pull Request #1671 · ros2/rclpy

dbwls99706 · 2026-05-20T13:19:36Z

Description

Properly backports #947 to humble by reverting the incomplete fix from #1492 and applying #947 directly via cherry-pick.

This PR was redone after @Ryan4253's review, which pointed out that the previous approach (layering an additional fix on top of #1492) left the ineffective guard in place and grew the rolling/humble drift.

Background

#1492 was intended to backport #947, but added an early-return guard at the top of Executor.shutdown() instead of the actual fix. The deadlock path (calling rclpy.shutdown() from within a callback) was not addressed because the first invocation finds _is_shutdown == False, falls through the guard, and then calls _work_tracker.wait(), which waits for the currently-executing callback - the source of the deadlock.

The unit test added in #1492 used executor.spin_once() in a manual loop and never triggered executor.shutdown() from inside a callback, so it did not exercise the deadlock condition.

Credit to @Ryan4253 for the diagnosis in #1646 and the suggestion to use git revert + git cherry-pick.

Changes

This PR contains three commits:

Revert of Fix: deadlock when calling rclpy.shutdown() from callbacks (backport #947) #1492 (8190ea0) - removes the ineffective guard and the misleading test.
Cherry-pick of fix rclpy.shutdown() from hanging when triggered from callback #947 (61ff49e, authored by @ihasdapie) - applies the original upstream fix as-is. _WorkTracker.wait() accepts an is_shutdown parameter, and the wait_for predicate returns immediately when shutdown has been signalled, releasing the calling callback.
Lint fix - removes an extra blank line introduced by the conflict resolution between the cherry-pick and test_not_lose_callback (a Humble-only test).

After this PR, rclpy/rclpy/executors.py and rclpy/test/test_executor.py match the corresponding state on rolling, eliminating drift.

Verification

Reproducer from #1646 (Ubuntu 22.04 / Humble):

Before this PR:

[INFO] [...] Node is running, waiting for timer...
[INFO] [...] Timer triggered, shutting down...
Killed
exit code: 137

After this PR:

[INFO] [...] Node is running, waiting for timer...
[INFO] [...] Timer triggered, shutting down...
DONE
exit code: 0

colcon test --packages-select rclpy passes locally. The only failures (test_clock STEADY_TIME variants) reproduce on the unmodified upstream/humble baseline and are unrelated to this change (WSL environment).

Fixes #1646.
Properly backports #947.

Is this user-facing behavior change?

Yes. Calling rclpy.shutdown() from within a callback (timer, subscription, etc.) on Humble no longer deadlocks the process. This matches the behavior on rolling.

Did you use Generative AI?

Yes - used Claude to assist with drafting the PR description. The fix itself is the upstream commit from #947 (@ihasdapie); the revert + cherry-pick approach was suggested by @Ryan4253; local verification was done by me.

Ryan4253 · 2026-05-20T22:53:38Z

We should also revert the changes from Fix: deadlock when calling rclpy.shutdown() from callbacks (backport #947) #1492 since they don't actually resolve the deadlock path
Why don't we simply cherry-pick the original fix from fix rclpy.shutdown() from hanging when triggered from callback #947? The changes in this PR is still a custom fix

I was able to do both very easily in https://github.com/Ryan4253/rclpy/commits/fix-callback-shutdown/ by running the following:

git revert 8190ea0
git cherry-pick -m 1 61ff49e

Given the above, I'm confused about why we're introducing another custom implementation while marketing it as "properly backporting #947" instead of simply applying #947 directly. This:

Creates additional version drift from rolling which makes cherry-picking future changes harder
Increases the chance of regression against an already validated upstream fix, which has already happened once

dbwls99706 · 2026-05-20T23:26:50Z

@Ryan4253 You're right - layering on top of #1492 keeps the dead guard and grows the drift. Redoing this PR with git revert + git cherry-pick as suggested. Force-pushing shortly.

…ackport ros2#947) (ros2#1492)" This reverts commit 8190ea0. Signed-off-by: dbwls99706 <yujinhong3@gmail.com>

fix rclpy.shutdown() from hanging when triggered from callback Signed-off-by: dbwls99706 <yujinhong3@gmail.com>

The cherry-pick of ros2#947 introduced an extra blank line between test_not_lose_callback (Humble-only) and the preceding method, because the conflict resolution kept both methods. Reduce to a single blank line to satisfy flake8 E303. Signed-off-by: dbwls99706 <yujinhong3@gmail.com>

fujitatomoya

lgtm with green CI.

fujitatomoya · 2026-05-21T02:12:48Z

@dbwls99706 thanks for fixing this up. there has been tic-toc for commits to backport, after all i believe that this is the proper way to backport the fix from the upstream branch to keep the clear git history to track.

fujitatomoya · 2026-05-21T02:14:11Z

Pulls: #1671
Gist: https://gist.githubusercontent.com/fujitatomoya/8b695df63dc436cc029fa1bbc1ee73fb/raw/3c1bce5c620279e815189e2d03f9659f77ff3754/ros2.repos
BUILD args: --packages-above-and-dependencies rclpy
TEST args: --packages-above rclpy
ROS Distro: humble
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/19336

Linux
Linux-aarch64
Linux-rhel
Windows

Ryan4253 · 2026-05-22T18:19:11Z

Looks like all the unit tests failed with a deprecation warning seemingly unrelated to the PR

INTERNALERROR> DeprecationWarning: The 'asyncio_mode' default value will change to 'strict' in future, please explicitly use 'asyncio_mode=strict' or 'asyncio_mode=auto' in pytest configuration file.

I did some preliminary investigation and couldn't find anything related to asyncio_mode within the repo. The error points to a newer version of pytest-asyncio being used but more recent CIs seems to run just fine. I can't access any older CI ran on humble so I can't confirm if this is causing a regression.

I can investigate deeper in ~10 hours.

Ryan4253 · 2026-05-22T19:31:02Z

Some things I forgot to note:

The unit test name shutdown_executor_from_callback wouldn't get discovered by pytest since it does not start with test_. This is an issue from fix rclpy.shutdown() from hanging when triggered from callback #947 and was fixed upstream in Introduce EventsExecutor implementation #1391. We should probably fix it within this PR directly.
There is a contract violation with the original fix in fix rclpy.shutdown() from hanging when triggered from callback #947. Based on the docs of shutdown it will wait until all workers complete (excerpt from rolling)

    def shutdown(self, timeout_sec: Optional[float] = None) -> bool:
        """
        Stop executing callbacks and wait for their completion.

        :param timeout_sec: Seconds to wait. Block forever if ``None`` or negative.
            Don't wait if 0.
        :return: ``True`` if all outstanding callbacks finished executing, or ``False`` if the
            timeout expires before all outstanding work is done.
        """
        with self._shutdown_lock:
            if not self._is_shutdown:
                self._is_shutdown = True
                # Tell executor it's been shut down
                if self._guard:
                    self._guard.trigger()
        if not self._is_shutdown:
            if not self._work_tracker.wait(timeout_sec):
                return False

After we exit the with self._shutdown_lock: block self._is_shutdown is guaranteed to be true. This means that this block will never execute:

        if not self._is_shutdown:
            if not self._work_tracker.wait(timeout_sec):
                return False

So we never actually wait for all workers to complete which is a contract violation. This is also why this comment exists in shutdown_executor_from_callback on rolling:

    def test_shutdown_executor_from_callback(self) -> None:
        """https://github.com/ros2/rclpy/issues/944: allow for executor shutdown from callback."""
        self.assertIsNotNone(self.node.handle)
        timer_period = 0.1
        # TODO(bmartin427) This seems like an invalid test to me?  executor.shutdown() is
        # documented as blocking until all callbacks are complete, unless you pass a non-negative
        # timeout value which this doesn't.  I'm not sure how that's supposed to *not* deadlock if
        # you block on all callbacks from within a callback.

(the answer is that the documentation doesn't match the behavior and we never wait for all callbacks to complete)

A more concrete fix that respects the contracts would be

Detect if the current context calling shutdown() is in a callback and if yes wait until there are one worker remaining
Require the user to supply a reasonably long enough timeout when shutting down from a callback so all callbacks complete and we just let the timeout to resolve the deadlock

Option 1 is safer for users, but it would require a more substantial code change and still would not perfectly satisfy the contract of waiting for all work to complete. Option 2 would not require code changes beyond reverting #947, but it would diverge from rclcpp behavior.

Either way this is better left as a fix in rolling rather than being addressed in this PR.

The test name from ros2#947 lacked the 'test_' prefix, so pytest did not discover it. The rename was applied upstream as part of ros2#1391; cherry-picking that change here so the test actually runs on Humble. Signed-off-by: dbwls99706 <yujinhong3@gmail.com>

dbwls99706 · 2026-05-30T09:05:51Z

@Ryan4253 Thanks for the careful read. On (1): agreed, added the test_ prefix in a new commit. On (2): agreed it's a rolling-level concern, leaving it out of this PR.

@fujitatomoya the linux / linux-aarch64 failures look unrelated to this PR - ci_linux-rhel passed rclpy 443/443 including the cherry-picked test, and the other runs aborted before producing results.

dbwls99706 mentioned this pull request May 20, 2026

rclpy.shutdown() from callback still deadlocks in Humble #1646

Open

dbwls99706 force-pushed the fix-callback-shutdown-humble-v2 branch from bf1e719 to 70a77be Compare May 20, 2026 23:27

dbwls99706 and others added 3 commits May 21, 2026 08:28

Revert "Fix: deadlock when calling rclpy.shutdown() from callbacks (b…

4aedac1

…ackport ros2#947) (ros2#1492)" This reverts commit 8190ea0. Signed-off-by: dbwls99706 <yujinhong3@gmail.com>

Merge pull request ros2#947 from ros2/brianc/rclpy_callback_shutdown

6a217fa

fix rclpy.shutdown() from hanging when triggered from callback Signed-off-by: dbwls99706 <yujinhong3@gmail.com>

dbwls99706 force-pushed the fix-callback-shutdown-humble-v2 branch from 70a77be to 9657855 Compare May 20, 2026 23:29

fujitatomoya approved these changes May 21, 2026

View reviewed changes

christophebedard assigned fujitatomoya May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly backport #947 to fix callback shutdown deadlock (humble)#1671

Properly backport #947 to fix callback shutdown deadlock (humble)#1671
dbwls99706 wants to merge 4 commits into
ros2:humblefrom
dbwls99706:fix-callback-shutdown-humble-v2

dbwls99706 commented May 20, 2026 •

edited

Loading

Uh oh!

Ryan4253 commented May 20, 2026 •

edited

Loading

Uh oh!

dbwls99706 commented May 20, 2026 •

edited

Loading

Uh oh!

fujitatomoya left a comment

Uh oh!

fujitatomoya commented May 21, 2026

Uh oh!

fujitatomoya commented May 21, 2026

Uh oh!

Ryan4253 commented May 22, 2026 •

edited

Loading

Uh oh!

Ryan4253 commented May 22, 2026 •

edited

Loading

Uh oh!

dbwls99706 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dbwls99706 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Background

Changes

Verification

Is this user-facing behavior change?

Did you use Generative AI?

Uh oh!

Ryan4253 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbwls99706 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fujitatomoya left a comment

Choose a reason for hiding this comment

Uh oh!

fujitatomoya commented May 21, 2026

Uh oh!

fujitatomoya commented May 21, 2026

Uh oh!

Ryan4253 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ryan4253 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbwls99706 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dbwls99706 commented May 20, 2026 •

edited

Loading

Ryan4253 commented May 20, 2026 •

edited

Loading

dbwls99706 commented May 20, 2026 •

edited

Loading

Ryan4253 commented May 22, 2026 •

edited

Loading

Ryan4253 commented May 22, 2026 •

edited

Loading