Summary
MultiThreadedExecutor.spin_until_future_complete() can return without propagating the exception of a task that raised during the call. The exception is not lost and surfaces on the next spin call or logged at garbage collection.
First observed as an intermittent failure of test_coroutine_exception_after_await on nightly_linux-aarch64:
https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_release/3543/testReport/junit/rclpy.rclpy.test.test_executor/TestExecutor/test_coroutine_exception_after_await/
Reproduction
I managed to reproduce the bug locally using sys.setswitchinterval(1e-6), forcing the interpreter to perform many context switches. I got 7 failures out of 100 test cycles.
Failure log:
AssertionError: RuntimeError not raised
The following exception was never retrieved: Expected error after await
Root cause
|
def _spin_once_impl( |
|
self, |
|
timeout_sec: Optional[Union[float, TimeoutObject]] = None, |
|
wait_condition: Callable[[], bool] = lambda: False |
|
) -> None: |
|
try: |
|
handler, entity, node = self.wait_for_ready_callbacks( |
|
timeout_sec, None, wait_condition) |
|
except ExternalShutdownException: |
|
pass |
|
except ShutdownException: |
|
pass |
|
except TimeoutException: |
|
pass |
|
except ConditionReachedException: |
|
pass |
|
else: |
|
self._executor.submit(handler) |
|
self._futures.append(handler) |
|
with self._futures_lock: |
|
for future in self._futures: |
|
if future.done(): |
|
self._futures.remove(future) |
|
future.result() # raise any exceptions |
|
|
|
# Yield GIL so executor threads have a chance to run. |
|
os.sched_yield() if hasattr(os, 'sched_yield') else time.sleep(0) |
Both SingleThreadedExecutor and MultiThreadedExecutor manage exception propagation via future.result().
While SingleThreadedExecutor checks for exceptions synchronously after executing a handler, MultiThreadedExecutor only checks after dispatching a new handler to the thread pool. The handler will only execute later, when a context switch occurs.
The race in test_coroutine_exception_after_await:
second_fut.set_result(None) reschedules coro_that_raises into _ready_tasks.
spin_until_future_complete(coro_that_raises, timeout_sec=5) enters its outer loop.
_spin_once_impl yields coro_that_raises as a handler, submits it to the thread pool, sweeps _futures (handler not yet done), sched_yield, returns.
- Next iteration enters
wait_set.wait.
coro_that_raises executes and raises, waking the wait set.
wait_for_ready_callbacks proceeds to its condition() check, sees that the exit condition (coro_that_raises.done) is True, raises ConditionReachedException.
_spin_once_impl catches the ConditionReachedException and returns without checking for exceptions. The _futures sweep that calls future.result() is only in the else: branch and is skipped on this path.
Summary
MultiThreadedExecutor.spin_until_future_complete()can return without propagating the exception of a task that raised during the call. The exception is not lost and surfaces on the next spin call or logged at garbage collection.First observed as an intermittent failure of
test_coroutine_exception_after_awaitonnightly_linux-aarch64:https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_release/3543/testReport/junit/rclpy.rclpy.test.test_executor/TestExecutor/test_coroutine_exception_after_await/
Reproduction
I managed to reproduce the bug locally using
sys.setswitchinterval(1e-6), forcing the interpreter to perform many context switches. I got 7 failures out of 100 test cycles.Failure log:
Root cause
rclpy/rclpy/rclpy/executors.py
Lines 1080 to 1106 in 8ca3134
Both SingleThreadedExecutor and MultiThreadedExecutor manage exception propagation via
future.result().While SingleThreadedExecutor checks for exceptions synchronously after executing a handler, MultiThreadedExecutor only checks after dispatching a new handler to the thread pool. The handler will only execute later, when a context switch occurs.
The race in
test_coroutine_exception_after_await:second_fut.set_result(None)reschedulescoro_that_raisesinto_ready_tasks.spin_until_future_complete(coro_that_raises, timeout_sec=5)enters its outer loop._spin_once_implyieldscoro_that_raisesas a handler, submits it to the thread pool, sweeps_futures(handler not yet done),sched_yield, returns.wait_set.wait.coro_that_raisesexecutes and raises, waking the wait set.wait_for_ready_callbacksproceeds to itscondition()check, sees that the exit condition (coro_that_raises.done) is True, raisesConditionReachedException._spin_once_implcatches the ConditionReachedException and returns without checking for exceptions. The_futuressweep that calls future.result() is only in theelse:branch and is skipped on this path.