Skip to content

MultiThreadedExecutor: race in spin_until_future_complete() might defer a task exception to next spin #1659

Description

@nadavelkabets

Summary

MultiThreadedExecutor.spin_until_future_complete() can return without propagating the exception of a task that raised during the call. The exception is not lost and surfaces on the next spin call or logged at garbage collection.

First observed as an intermittent failure of test_coroutine_exception_after_await on nightly_linux-aarch64:
https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_release/3543/testReport/junit/rclpy.rclpy.test.test_executor/TestExecutor/test_coroutine_exception_after_await/

Reproduction

I managed to reproduce the bug locally using sys.setswitchinterval(1e-6), forcing the interpreter to perform many context switches. I got 7 failures out of 100 test cycles.

Failure log:

AssertionError: RuntimeError not raised
The following exception was never retrieved: Expected error after await

Root cause

rclpy/rclpy/rclpy/executors.py

Lines 1080 to 1106 in 8ca3134

def _spin_once_impl(
self,
timeout_sec: Optional[Union[float, TimeoutObject]] = None,
wait_condition: Callable[[], bool] = lambda: False
) -> None:
try:
handler, entity, node = self.wait_for_ready_callbacks(
timeout_sec, None, wait_condition)
except ExternalShutdownException:
pass
except ShutdownException:
pass
except TimeoutException:
pass
except ConditionReachedException:
pass
else:
self._executor.submit(handler)
self._futures.append(handler)
with self._futures_lock:
for future in self._futures:
if future.done():
self._futures.remove(future)
future.result() # raise any exceptions
# Yield GIL so executor threads have a chance to run.
os.sched_yield() if hasattr(os, 'sched_yield') else time.sleep(0)

Both SingleThreadedExecutor and MultiThreadedExecutor manage exception propagation via future.result().
While SingleThreadedExecutor checks for exceptions synchronously after executing a handler, MultiThreadedExecutor only checks after dispatching a new handler to the thread pool. The handler will only execute later, when a context switch occurs.

The race in test_coroutine_exception_after_await:

  1. second_fut.set_result(None) reschedules coro_that_raises into _ready_tasks.
  2. spin_until_future_complete(coro_that_raises, timeout_sec=5) enters its outer loop.
  3. _spin_once_impl yields coro_that_raises as a handler, submits it to the thread pool, sweeps _futures (handler not yet done), sched_yield, returns.
  4. Next iteration enters wait_set.wait.
  5. coro_that_raises executes and raises, waking the wait set.
  6. wait_for_ready_callbacks proceeds to its condition() check, sees that the exit condition (coro_that_raises.done) is True, raises ConditionReachedException.
  7. _spin_once_impl catches the ConditionReachedException and returns without checking for exceptions. The _futures sweep that calls future.result() is only in the else: branch and is skipped on this path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions