fix: add 30s timeout to WebRTC connection_ready.wait() to prevent indefinite hang#1446
fix: add 30s timeout to WebRTC connection_ready.wait() to prevent indefinite hang#1446hculap wants to merge 2 commits intodimensionalOS:devfrom
Conversation
…efinite hang Blueprint.build() and direct UnitreeWebRTCConnection() hang forever when the WebRTC SDP exchange fails (port closed, another client connected, robot unreachable). connection_ready.wait() has no timeout, blocking the subprocess/main thread indefinitely with no error message. Add a 30-second timeout with thread cleanup and a descriptive TimeoutError listing common causes and troubleshooting steps. Common trigger: Go2 only allows one WebRTC client at a time. If the Unitree mobile app is connected, the SDP exchange returns 'reject' and the event never fires. Error now surfaces immediately instead of hanging forever. Tested: Go2 Pro, stock firmware, STA mode (home WiFi), DimOS v0.0.10.post2
Greptile SummaryThis PR fixes a real and well-documented hang in Key findings:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant connect()
participant BackgroundThread
participant asyncio Loop
participant Robot (WebRTC)
Caller->>connect(): call connect()
connect()->>asyncio Loop: new_event_loop()
connect()->>BackgroundThread: thread.start()
BackgroundThread->>asyncio Loop: create_task(async_connect())
BackgroundThread->>asyncio Loop: loop.run_forever()
asyncio Loop->>Robot (WebRTC): conn.connect() (SDP exchange)
alt Connection succeeds (happy path)
Robot (WebRTC)-->>asyncio Loop: SDP accepted
asyncio Loop->>connect(): connection_ready.set()
connect()->>Caller: returns normally
else Timeout after 30 s (e.g. SDP rejected / unreachable)
connect()->>asyncio Loop: call_soon_threadsafe(loop.stop)
Note over connect(): ⚠ task NOT cancelled<br/>⚠ conn.disconnect() NOT called
connect()->>BackgroundThread: thread.join(timeout=2.0)
Note over connect(): thread may still be alive if join times out
connect()->>Caller: raises TimeoutError
Note over Robot (WebRTC): WebRTC session may still be<br/>open on the robot side
end
Last reviewed commit: 18f76d5 |
| if not self.connection_ready.wait(timeout=CONNECT_TIMEOUT): | ||
| if self.loop.is_running(): | ||
| self.loop.call_soon_threadsafe(self.loop.stop) | ||
| if self.thread.is_alive(): | ||
| self.thread.join(timeout=2.0) | ||
| raise TimeoutError( | ||
| f"WebRTC connection to {self.ip} timed out after {CONNECT_TIMEOUT}s. " | ||
| "Common causes:\n" | ||
| " - Another WebRTC client is connected (close the Unitree mobile app)\n" | ||
| " - Robot is unreachable on the network\n" | ||
| " - Port 9991 (encrypted SDP) is not responding\n" | ||
| "Tip: only one WebRTC client can connect to the Go2 at a time." | ||
| ) |
There was a problem hiding this comment.
Missing conn.disconnect() and task.cancel() in timeout cleanup
When the timeout fires, the cleanup stops the event loop but never cancels the pending async_connect() task or calls self.conn.disconnect(). Compare this with the existing stop() and disconnect() methods, which always cancel the task and schedule await self.conn.disconnect() before stopping the loop.
Without calling self.conn.disconnect(), the robot-side WebRTC peer connection may remain open or in a "connecting" state, potentially blocking the very retry attempt the user will make after seeing the TimeoutError — the same root cause as the original bug. It can also emit a Python runtime warning: Task was destroyed but it is pending!.
The cleanup block should mirror the pattern used in stop():
if not self.connection_ready.wait(timeout=CONNECT_TIMEOUT):
if self.loop.is_running():
if self.task is not None:
self.loop.call_soon_threadsafe(self.task.cancel)
async def async_disconnect() -> None:
try:
await self.conn.disconnect()
except Exception:
pass
asyncio.run_coroutine_threadsafe(async_disconnect(), self.loop)
self.loop.call_soon_threadsafe(self.loop.stop)
if self.thread.is_alive():
self.thread.join(timeout=2.0)
raise TimeoutError(...)| self.task = self.loop.create_task(async_connect()) | ||
| self.loop.run_forever() | ||
|
|
||
| CONNECT_TIMEOUT = 30 # seconds |
There was a problem hiding this comment.
CONNECT_TIMEOUT should be a class-level attribute
Defining the timeout as a method-local constant makes it invisible at the class API level and impossible for callers to customise without patching the source. Users on slower or congested networks might need a longer window; users in CI want a shorter one.
A class attribute is both more discoverable and overridable:
| CONNECT_TIMEOUT = 30 # seconds | |
| CONNECT_TIMEOUT = 30 # seconds — override on the class or instance if needed |
Or better, promote it to the class body:
class UnitreeWebRTCConnection(Resource):
CONNECT_TIMEOUT: int = 30 # seconds
...and reference it as self.CONNECT_TIMEOUT inside connect().
| if self.thread.is_alive(): | ||
| self.thread.join(timeout=2.0) |
There was a problem hiding this comment.
Background thread may still be alive when TimeoutError is raised
thread.join(timeout=2.0) is non-blocking past its deadline — if the thread has not stopped within 2 s (e.g. a blocking C-extension call inside conn.disconnect()), execution falls through silently and TimeoutError is raised while the daemon thread is still running.
If a caller catches the TimeoutError and immediately tries to reconnect, the still-alive thread sharing self.loop / self.conn could cause a race condition. Consider logging a warning when the join times out:
if self.thread.is_alive():
self.thread.join(timeout=2.0)
if self.thread.is_alive():
import warnings
warnings.warn(
"Background WebRTC thread did not stop within 2s after timeout cleanup.",
RuntimeWarning,
stacklevel=2,
)Address review feedback: the timeout cleanup path was missing task.cancel() and conn.disconnect(), which could leave a half-open WebRTC session on the robot side. A subsequent connect() call would then time out again for the same reason, recreating the original infinite-hang problem. Cleanup sequence on timeout: 1. Cancel the async_connect task (CancelledError handled) 2. Call conn.disconnect() if conn was partially initialised 3. Stop the event loop 4. Join the background thread (2s timeout) 5. Raise TimeoutError with diagnostic message
|
Thanks for the review! Good catch — pushed a fixup commit (2aefe0c) that adds |
|
Hey @lesh-dimensionalOS 👋 — gentle ping on this one. You confirmed the fix resolves the robot motion issue back in early March. Any chance this can get reviewed and merged into |
Problem
Blueprint.build()and directUnitreeWebRTCConnection(ip)hang indefinitely when the WebRTC SDP exchange fails. The root cause isconnection_ready.wait()inconnect()has no timeout — if the connection never establishes, the event is never set and the thread blocks forever with no error.Common trigger: Go2 only allows one WebRTC client at a time. If the Unitree mobile app is open, the SDP exchange returns
"sdp": "reject"and the event never fires. The process appears to hang with no indication of what went wrong.Fix
Add a 30-second timeout with thread cleanup and a descriptive
TimeoutErrorlisting common causes:Testing
Tested on: Go2 Pro, stock firmware, STA mode (home WiFi), DimOS v0.0.10.post2, macOS Apple Silicon.
TimeoutErrorafter 30s with helpful messageTimeoutErrorafter 30sCloses #1438 (previous PR targeting main — same fix, retargeted to dev as requested).