Skip to content

Fix goroutine leak: cancel socket context after close handlers#3

Open
ertakbuga wants to merge 2 commits intoRobertWHurst:masterfrom
ertakbuga:fix/goroutine-leak-context-canceled
Open

Fix goroutine leak: cancel socket context after close handlers#3
ertakbuga wants to merge 2 commits intoRobertWHurst:masterfrom
ertakbuga:fix/goroutine-leak-context-canceled

Conversation

@ertakbuga
Copy link

Summary

  • Fix goroutine leak when connections close via context.Canceled (the Eurus close path). Handler goroutines blocked in ReceiveInto/ReceiveIntoWithContext wait on socket.Done() which never fires because cancelCtx() is never called after close handlers run.
  • Primary fix (router.go): Call socket.cancelCtx() after HandleClose returns in both HandleConnection and handleWebsocketConnection. This ensures all in-flight handler goroutines are unblocked after close handlers complete.
  • Defense-in-depth (socket.go): Call s.Close() on context.Canceled in HandleNextMessageWithNode for consistency with other error paths (websocket close status already calls Close).
  • Regression test (router_goroutine_leak_test.go): Reproduces the exact Eurus close scenario — a handler blocks on ReceiveInto, connection returns context.Canceled, and verifies the handler goroutine exits (socket context is cancelled).

Context

In production (Applications-Service), the SubscribeEntry WebSocket handler spawns goroutines that block on ReceiveInto waiting for follow-up messages. When the connection closes via Eurus (which returns context.Canceled), HandleNextMessageWithNode returns false without calling socket.Close(). The router then calls socket.HandleClose() which runs close handler callbacks but never calls socket.cancelCtx(). Handler goroutines blocked in ReceiveInto wait on socket.Done() which never fires → permanent goroutine leak (~3,000 goroutines/hour, causing OOMKills).

Changes

File Change
router.go Add socket.cancelCtx() after HandleClose in both HandleConnection and handleWebsocketConnection
socket.go Add s.Close() in HandleNextMessageWithNode for the context.Canceled error path
router_goroutine_leak_test.go New regression test proving handler goroutines are cleaned up on context.Canceled close path

Test plan

  • New test TestHandlerGoroutineCleanupOnContextCanceled passes
  • All existing tests pass (go test ./...)
  • Deploy Applications-Service with updated Velaros to QA, monitor goroutine count (should stay flat instead of climbing)

🤖 Generated with Claude Code

ertakbuga and others added 2 commits March 5, 2026 11:37
When a connection closes via context.Canceled (the Eurus close path),
handler goroutines blocked in ReceiveInto/ReceiveIntoWithContext wait on
socket.Done() which never fires because cancelCtx() is never called.
This causes permanent goroutine leaks (~3k/hour in production).

Two fixes:
1. router.go: call socket.cancelCtx() after HandleClose returns in both
   HandleConnection and handleWebsocketConnection. This is the primary
   fix — it ensures all in-flight handler goroutines are unblocked after
   close handlers complete.

2. socket.go: call s.Close() on context.Canceled in
   HandleNextMessageWithNode for consistency with other error paths.
   This is defense-in-depth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… sync

Two issues that cause data races detected by `go test -race`:

1. Router.Handle/HandleOpen/HandleClose read `subCtx.currentHandlerNode`
   AFTER calling `subCtx.free()`, which returns the context to sync.Pool.
   Another goroutine can Get the same context and write to it concurrently.
   Fix: capture the field before calling free().

2. HandleNextMessageWithNode spawns a goroutine for message handling but
   handleWebsocketConnection proceeds to HandleClose without waiting for
   it to complete. The goroutine's ctx.free() can race with HandleClose's
   contextFromPool(). Fix: track goroutines with sync.WaitGroup on Socket
   and wait before HandleClose.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant