Skip to content

test_executor_immediate_shutdown flaky failure in CI #1648

Description

@nadavelkabets

Environment

ROS2 rolling CI for Linux, Linux-aarch64, Linux-rhel and Windows

Description

test_executor_immediate_shutdown fails on CI with:

19:36:08 14: ============================= test session starts ==============================
19:36:08 14: platform linux -- Python 3.12.3, pytest-7.4.4, pluggy-1.4.0
19:36:08 14: cachedir: /home/jenkins-agent/workspace/ci_linux/ws/build/rclpy/ament_cmake_pytest/test_executor/.cache
19:36:08 14: rootdir: /home/jenkins-agent/workspace/ci_linux/ws/src/nadavelkabets/rclpy
19:36:08 14: configfile: pytest.ini
19:36:08 14: plugins: launch-testing-3.9.7, ament-mypy-0.20.4, ament-xmllint-0.20.4, ament-copyright-0.20.4, ament-pep257-0.20.4, ament-flake8-0.20.4, ament-lint-0.20.4, mock-3.12.0, rerunfailures-12.0, repeat-0.9.3, timeout-2.2.0, colcon-core-0.20.1, cov-4.1.0
19:36:08 14: collected 36 items
19:36:08 14: 
19:36:08 14: ../../src/nadavelkabets/rclpy/rclpy/test/test_executor.py .............. [ 38%]
19:36:08 14: .F....................                                                   [100%]
19:36:08 14: 
19:36:08 14: =================================== FAILURES ===================================
19:36:08 14: ________________ TestExecutor.test_executor_immediate_shutdown _________________
19:36:08 14: 
19:36:08 14: self = <test.test_executor.TestExecutor testMethod=test_executor_immediate_shutdown>
19:36:08 14: 
19:36:08 14:     def test_executor_immediate_shutdown(self) -> None:
19:36:08 14:         self.assertIsNotNone(self.node.handle)
19:36:08 14:         for cls in [SingleThreadedExecutor, EventsExecutor]:
19:36:08 14:             with self.subTest(cls=cls):
19:36:08 14:                 executor = cls(context=self.context)
19:36:08 14:                 try:
19:36:08 14:                     got_callback = False
19:36:08 14:     
19:36:08 14:                     def timer_callback() -> None:
19:36:08 14:                         nonlocal got_callback
19:36:08 14:                         got_callback = True
19:36:08 14:     
19:36:08 14:                     timer_period = 1
19:36:08 14:                     tmr = self.node.create_timer(timer_period, timer_callback)
19:36:08 14:     
19:36:08 14:                     self.assertTrue(executor.add_node(self.node))
19:36:08 14:                     t = threading.Thread(target=executor.spin, daemon=True)
19:36:08 14:                     start_time = time.perf_counter()
19:36:08 14:                     t.start()
19:36:08 14:                     executor.shutdown()
19:36:08 14:                     t.join()
19:36:08 14:                     end_time = time.perf_counter()
19:36:08 14:     
19:36:08 14:                     self.node.destroy_timer(tmr)
19:36:08 14: >                   self.assertLess(end_time - start_time, timer_period / 2)
19:36:08 14: E                   AssertionError: 1.0002155330148526 not less than 0.5
19:36:08 14: 
19:36:08 14: ../../src/nadavelkabets/rclpy/rclpy/test/test_executor.py:107: AssertionError
19:36:08 14: - generated xml file: /home/jenkins-agent/workspace/ci_linux/ws/build/rclpy/test_results/rclpy/test_executor.xunit.xml -
19:36:08 14: =========================== short test summary info ============================
19:36:08 14: FAILED ../../src/nadavelkabets/rclpy/rclpy/test/test_executor.py::TestExecutor::test_executor_immediate_shutdown
19:36:08 14: ======================== 1 failed, 35 passed in 26.79s =========================
19:36:08 14: -- run_test.py: return code 1
19:36:08 14: -- run_test.py: verify result file '/home/jenkins-agent/workspace/ci_linux/ws/build/rclpy/test_results/rclpy/test_executor.xunit.xml'
19:36:08 14/59 Test #14: test_executor .....................***Failed   27.54 sec

Reproduction

I built a rolling source checkout and ran the executor test suite 30 times sequentially with 0 failures.
I managed to reproduce the bug with 2 different code alterations.
Probable race condition - calling executor.shutdown() after entering _wait_for_ready_callbacks() but before adding the executor's guard condition to the wait_set.

Forced context switch

I added a time.sleep() call in _wait_for_ready_callbacks() just after entering the main while loop:

def _wait_for_ready_callbacks():
    ...
    while not yielded_work and not self._is_shutdown and not condition():
        time.sleep(0.001) # added line

With this change, the test failed every time.

Inducing more context switches

To simulate a fully loaded CI runner, I decreased the python context switch interval using sys.setswitchinterval(1e-6).
With this change, the test failed 5 times out of 30 sequential runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions