Bug report
Required Info:
- Operating System: Ubuntu 22.04 (inside
nvcr.io/nvidia/isaac-sim:5.0.0 container)
- Installation type: binaries (
apt install ros-humble-rclpy)
- Version or commit hash:
rclpy 3.3.21-1jammy.20260421.025639
- DDS implementation:
rmw_cyclonedds_cpp
- Client library: rclpy
- Python: 3.10.12
Steps to reproduce
rclpy/action/server.py:357 does self._result_futures[bytes(goal_uuid)].set_result(result_response) without checking whether the future is still in the dict. Under real-time load, the future can be removed by _execute_expire_goals before _execute_goal reaches that line, raising KeyError and crashing the executor.
Distinct trigger from #1236 (which was sim-time fast clock): wall-clock goal-timeout abort under sustained ActionServer load. Our NavigateToPose action server runs a 20 Hz control loop and aborts goals on elapsed > goal_timeout_s = 1200.0. Two out of six independent ActionServers crashed with this KeyError in a single 64-order multi-robot simulation run.
Minimal repro shape:
- ActionServer running on
MultiThreadedExecutor, ReentrantCallbackGroup
- Long-running
execute_callback that returns after goal_handle.abort() on timeout
- Sustained load — many sequential goals per server, several servers in the process
- Real time (
use_sim_time=False)
Expected behavior
Goal cleanup after abort should not race with the goal-result publish, or the publish should defensively check whether the future still exists before calling set_result.
Actual behavior
[pure_pursuit_driver-6] [WARN] [pure_pursuit_driver]: goal aborted: timeout after 1200.0s (>1200.0s)
[pure_pursuit_driver-6] Traceback (most recent call last):
File "/opt/ros/humble/lib/python3.10/site-packages/rclpy/executors.py", line 323, in spin
self.spin_once()
File "/opt/ros/humble/lib/python3.10/site-packages/rclpy/executors.py", line 863, in spin_once
self._spin_once_impl(timeout_sec)
File "/opt/ros/humble/lib/python3.10/site-packages/rclpy/executors.py", line 860, in _spin_once_impl
future.result()
File "/opt/ros/humble/lib/python3.10/site-packages/rclpy/task.py", line 109, in result
raise self.exception()
File "/opt/ros/humble/lib/python3.10/site-packages/rclpy/task.py", line 272, in _execute_coroutine_step
result = coro.send(None)
File "/opt/ros/humble/lib/python3.10/site-packages/rclpy/action/server.py", line 357, in _execute_goal
self._result_futures[bytes(goal_uuid)].set_result(result_response)
KeyError: b'S\xe2\xd7\xc6\xfd\xd7@\x02\x98^E\xc4\xe9\xcd4x'
Second concurrent crash in same run, same trace, different uuid (b'\x95\x86\xafj\x8e\xe9E\xb5\x8e\xb7\x0bd/C\xe5\xa5'). Full logs: docs/runs/m7_steady_state_gt/pure_pursuit.log.
Suggested fix
Defensive lookup at action/server.py:357:
# Before
self._result_futures[bytes(goal_uuid)].set_result(result_response)
# After
future = self._result_futures.get(bytes(goal_uuid))
if future is not None and not future.done():
future.set_result(result_response)
else:
self._logger.warn(
f"_execute_goal: result_future for goal "
f"{bytes(goal_uuid).hex()} unavailable (already expired or completed); "
f"skipping set_result"
)
Addresses the symptom. A proper fix likely needs _execute_expire_goals and _execute_goal to coordinate on future ownership (e.g. atomic pop-and-handle rather than independent lookups).
Prior art / related issues
Workaround we shipped (client side)
Wrapping executor.spin() in a KeyError retry loop keeps the node alive but masks the real bug. (Commit.)
while rclpy.ok():
try:
executor.spin()
break
except KeyError as exc:
node.get_logger().warn(
f"executor.spin raised KeyError on goal-result cleanup "
f"(rclpy ActionServer race, goal_uuid={exc!r}); resuming"
)
continue
Happy to put up a PR for the defensive-lookup change if there's interest.
Bug report
Required Info:
nvcr.io/nvidia/isaac-sim:5.0.0container)apt install ros-humble-rclpy)rclpy 3.3.21-1jammy.20260421.025639rmw_cyclonedds_cppSteps to reproduce
rclpy/action/server.py:357doesself._result_futures[bytes(goal_uuid)].set_result(result_response)without checking whether the future is still in the dict. Under real-time load, the future can be removed by_execute_expire_goalsbefore_execute_goalreaches that line, raisingKeyErrorand crashing the executor.Distinct trigger from #1236 (which was sim-time fast clock): wall-clock goal-timeout abort under sustained ActionServer load. Our
NavigateToPoseaction server runs a 20 Hz control loop and aborts goals onelapsed > goal_timeout_s = 1200.0. Two out of six independent ActionServers crashed with thisKeyErrorin a single 64-order multi-robot simulation run.Minimal repro shape:
MultiThreadedExecutor,ReentrantCallbackGroupexecute_callbackthat returns aftergoal_handle.abort()on timeoutuse_sim_time=False)Expected behavior
Goal cleanup after abort should not race with the goal-result publish, or the publish should defensively check whether the future still exists before calling
set_result.Actual behavior
Second concurrent crash in same run, same trace, different uuid (
b'\x95\x86\xafj\x8e\xe9E\xb5\x8e\xb7\x0bd/C\xe5\xa5'). Full logs: docs/runs/m7_steady_state_gt/pure_pursuit.log.Suggested fix
Defensive lookup at
action/server.py:357:Addresses the symptom. A proper fix likely needs
_execute_expire_goalsand_execute_goalto coordinate on future ownership (e.g. atomic pop-and-handle rather than independent lookups).Prior art / related issues
KeyErrorfrom a different trigger (use_sim_time=Truewith sim time accelerated 60x). It was closed as fixed by PR adjust debug message that requested goal is already expired. #1363.Workaround we shipped (client side)
Wrapping
executor.spin()in aKeyErrorretry loop keeps the node alive but masks the real bug. (Commit.)Happy to put up a PR for the defensive-lookup change if there's interest.