fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx#17
Conversation
Two bugs in internal/wait-tx that together caused a steady WebSocket- subscription leak on CometBFT RPC :26657. Observed in production on lumera-devnet-1 val3: ~2562 ESTABLISHED sockets accumulated over 11 days while peer validators (same image, same uptime) sat at ~16. Bug 1 (PRIMARY) — waiter.go:57 The subscriber goroutine was invoked with the OUTER ctx instead of subCtx. When setupDelay (default 5s) fired, the caller returned via the subCtx.Done() arm and fell through to the gRPC poller — but the goroutine kept blocking on <-ch inside subscriber.Wait for the lifetime of the outer ctx. Its 'defer client.Stop()' therefore did not run until the outer ctx expired, which for many sdk-go consumers (network-maker / NM-style batch tasks with no per-call deadline) meant 'never'. Each abandoned goroutine pinned one rpchttp.Client open and therefore one WS conn to lumerad. Fix: pass subCtx (not ctx) to subscriber.Wait. Once setupDelay fires, the goroutine's <-ctx.Done() returns immediately, defers unwind (Unsubscribe + Stop), gorilla.WS conn is closed, server-side socket hits CLOSE_WAIT and is reaped. Bug 2 (SECONDARY) — subscriber.go - subscriberID was a package-level constant 'sdk-go-wait', meaning any future code path that shares an rpchttp.Client across concurrent Subscribe calls would collide on cometbft's client-side query map. Today each Wait() builds its own client so this is latent, but rotate IDs now to keep the design forward-safe for client reuse. - defer client.Unsubscribe(context.Background(), ...) had no timeout. If the server is slow/unresponsive, Unsubscribe could block teardown indefinitely, defeating the whole point of guaranteed Stop(). Bound it to 2s; we proceed to client.Stop() regardless, which closes the socket unconditionally. - Added 'always-stop' defer pattern (anonymous func wrapping client.Stop()) for clarity and to ensure no future refactor accidentally short-circuits the Stop() call. Regression test: internal/wait-tx/waiter_test.go TestWaiterSubscriberReceivesSubCtxNotOuter installs a blocking source that records its incoming ctx and asserts it becomes Done within 10x setupDelay. Without this fix the test FAILS with 'goroutine is bound to outer ctx (leak)'; with the fix it PASSES in ~30ms. Mainnet exposure: HIGH for any validator with public RPC. Any sdk-go consumer that calls WaitForTxInclusion in a retry loop without a per-call deadline will leak WS sockets at ~20%/attempt stickiness, saturating max_open_connections=900 within a couple hours of activity. Risk on this PR: low. Behaviour change only on the timeout/error paths that previously leaked goroutines; happy-path latency is unchanged. No public API change. Refs: ops RCA at /root/lumera/ws-leak-rca.md
mateeullahmalik
left a comment
There was a problem hiding this comment.
Production-gate review — fix(wait-tx): plug WS-subscriber leak
HEAD SHA reviewed: d93da7fcdbf909dcba591422fc94694838c19e3d
Method: code re-read at tip + race-enabled test run + consensus-killer grep catalog + caller-impact audit.
✅ Correctness — fix matches the documented bug shape
waiter.go (Bug 1): the goroutine now receives subCtx instead of the outer ctx. Traced control-flow at tip:
subCtx.Done()fires aftersetupDelay→ callerselectreturns- the spawned goroutine, also bound to
subCtx, seesctx.Done()insidesubscriber.Waiton its next channel-receive - defers unwind LIFO →
Unsubscribe(bounded 2s) →client.Stop()→ gorilla WS closed → server-side socket reaped - buffered
resCh/errCh(cap=1) prevent the goroutine blocking on the send if the caller has already moved on → no goroutine leak
subscriber.go (Bug 2):
newSubscriberID()usesos.Getpid()+atomic.AddUint64(&subscriberSeq, …)— unique per call, no shared-state collision risk- defer ordering verified:
client.Stop()registered first →Unsubscriberegistered second → LIFO runs Unsubscribe then Stop. Correct: must unsubscribe before stopping the client so the unsubscribe RPC isn't sent over a dead conn unsubscribeTimeout = 2sbounds teardown;client.Stop()runs unconditionally on failure → socket closes regardless
✅ Tests — race-clean, regression-targeted
$ go test -race -count=1 -timeout 60s ./internal/wait-tx/...
ok github.com/LumeraProtocol/sdk-go/internal/wait-tx 1.094s
Full suite + vet clean:
$ go test -race -count=1 ./... → all PASS
$ go vet ./... → clean
TestWaiterSubscriberReceivesSubCtxNotOuter correctly distinguishes pre- vs post-fix behaviour (verified by inverting the diff: test fails on master). Test asserts (1) goroutine's recorded ctx fires within 10× setupDelay, (2) goroutine actually unwinds, (3) poller is invoked exactly once. Tight bound (250ms) but generous enough for loaded CI — not flaky-prone.
✅ Caller-impact audit
WaitForTxInclusion is consumed at 11 internal sites (blockchain/supernode.go ×6, blockchain/action.go ×4, ica/controller.go). No public API change (New, Wait, Source signatures all stable). Happy-path latency unchanged (same one round-trip Subscribe, same first-event return). All 11 consumers benefit from the leak fix automatically — no caller changes required.
✅ Consensus-killer grep catalog (per chain-PR safety methodology, applied even though this is client-side)
| Class | Hits on changed surface | Status |
|---|---|---|
| Float arithmetic | 0 | ✓ |
time.Now / wall-clock |
0 (only time.Second literals) |
✓ |
rand.* / non-deterministic |
0 | ✓ |
| Map iteration order | range txev.Result.Events + inner range e.Attributes — but flattens into a client-side return map, zero consensus surface |
✓ |
Panic class (MustGet, MustUnmarshal, etc.) |
0 | ✓ |
| Binary decode without length check | 0 | ✓ |
| Division/modulo with data-driven divisor | 0 | ✓ |
| Slice indexing from data | 0 | ✓ |
| Goroutine / channel | 1 goroutine, buffered channels, bound to subCtx | ✓ no leak |
Zero consensus-state surface (sdk-go is a client library; not in EndBlock / BeginBlock / DeliverTx / InitGenesis path). Cannot halt chain. Worst-case-wrong = unchanged from pre-fix behaviour.
⚠ Observations (non-blocking, follow-up only)
- No log/metric when subscriber falls through to poller. If
:26657is mis-configured or unreachable, the subscriber path silently degrades and only the poller's slower latency is observed. Operationally invisible. Suggest a debug-level log + a counter — but not a merge-blocker; this is identical to pre-fix observability. selectrace on simultaneousresChready +subCtx.Done. When both arms are ready, Go picks randomly; ifsubCtx.Done()wins, the subscriber's valid Result is discarded and we fall through to the poller (which will fetch the same tx). Correct outcome, slight latency cost. Acceptable.- Stop()/Unsubscribe finish slightly after
Wait()returns (background defers in the goroutine). For up to ~2s after return, one WS socket may still be ESTABLISHED. Not a leak — it WILL close. Just a brief tail. CallingWaitForTxInclusionin a tight loop won't exceed1 socket × concurrent-calls— bounded. unsubscribeTimeoutis a package-private constant (2s). Reasonable default; not worth exposing through config unless we see real-world server-side slow-unsubscribe pathologies.
📌 Live-production confirmation
The bug shape is currently re-manifesting on lumera-devnet-1 val3 as we speak:
ss -ltn sport = :26657→ Recv-Q4097, Send-Q4096(listen backlog FULL)- ~4,998 ESTABLISHED conns on :26657 (val1/2/4/5 sit at ~16)
- external port
:26687is now unresponsive (kernel dropping SYNs)
This PR is the actual fix for the live degradation, not a theoretical one.
Verdict
🟢 SAFE TO MERGE.
Zero consensus impact (client library). Real fix to a confirmed, currently-manifesting production bug. Tests are race-clean and regression-protective at the exact bug shape. No caller changes required.
After merge, lumera-uploader needs a rebuild against the new sdk-go tag to pick up the fix on lumera-devnet-1 val3 — and ideally the val3 lumerad container restarted to fully drain the existing 5k accumulated subscribers.
Follow-ups (separate work, already enumerated in PR description):
- lumera-uploader: share one rpchttp.Client per signer (reduces socket churn ~500×)
- lumera chain:
experimental_close_on_slow_client = true+ idle WS timeout in default template as defense-in-depth - integration test in lumera-uploader: submit-N-failing-txs and assert <N lingering sockets
There was a problem hiding this comment.
Pull request overview
This PR fixes a WebSocket subscriber lifecycle leak in internal/wait-tx by ensuring subscriber goroutines are bound to the short setup context and by tightening subscriber teardown behavior.
Changes:
- Passes
subCtxintosubscriber.Waitso abandoned subscriber goroutines unwind aftersetupDelay. - Adds per-call CometBFT subscriber IDs and bounds unsubscribe teardown with a timeout.
- Adds a regression test proving the subscriber goroutine receives the setup context and falls back to the poller.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
internal/wait-tx/waiter.go |
Binds the subscriber goroutine to subCtx to prevent long-lived WS connections after fallback. |
internal/wait-tx/subscriber.go |
Adds unique subscriber IDs and bounded unsubscribe cleanup before stopping the RPC client. |
internal/wait-tx/waiter_test.go |
Adds regression coverage for the subscriber context lifetime and poller fallback behavior. |
fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx
What & why
Two bugs in
internal/wait-txthat together caused a steady WebSocket-subscription leak on CometBFT RPC:26657. Observed in production onlumera-devnet-1val3: ~2,562 ESTABLISHED TCP sockets accumulated over 11 days while peer validators (same image, same uptime for val2/val5) sat at ~16. Only differentiator: val3 ran thenetwork-maker(lumera-uploader) process, which exercisesWaitForTxInclusionat high frequency.Full RCA:
ws-leak-rca.md.Bug 1 (PRIMARY) —
waiter.go:57When
subCtxfired the caller fell through to the gRPC poller — but the goroutine kept blocking on<-chinsidesubscriber.Waitfor the lifetime of the outer ctx. Itsdefer client.Stop()(the only thing that closes the gorilla WS conn) did not run until the outer ctx expired. For many sdk-go consumers (NM-style batch tasks with no per-call deadline) that means never — each abandoned goroutine pinned onerpchttp.Clientopen and therefore one WS conn to lumerad until the parent process restarted.Fix: pass
subCtx(notctx) tosubscriber.Wait. Once setupDelay fires, the goroutine's<-ctx.Done()returns immediately, defers unwind (Unsubscribe + Stop), gorilla WS is closed, server-side socket hits CLOSE_WAIT and is reaped.Bug 2 (SECONDARY) —
subscriber.gosubscriberIDwas the package-level constant"sdk-go-wait". Any future code path that shares onerpchttp.Clientacross concurrentSubscribecalls would collide on cometbft's client-sideWSEvents.subscriptions[query]map. Today eachWait()builds its own client so this is latent, but rotated IDs (sdk-go-wait-<pid>-<seq>) keep the design forward-safe for the planned client-reuse refactor inlumera-uploader.defer client.Unsubscribe(context.Background(), …)had no timeout. A slow/unresponsive server could block teardown indefinitely. Bounded to 2s; we proceed toclient.Stop()regardless (which closes the socket unconditionally).defer func() { _ = client.Stop() }()pattern so no future refactor can accidentally short-circuit the Stop() call.Risk
New()signature,Wait()signature,Sourceinterface, all unchanged.Mainnet exposure
HIGH for any validator exposing RPC publicly. Any sdk-go consumer that calls
WaitForTxInclusionin a retry loop without a per-call deadline will leak WS sockets at ~20%/attempt stickiness (measured rate on val3). At ~1k failed attempts/day per client, a node with defaultmax_open_connections=900saturates within hours of activity.Likely external consumers:
WaitForTxInclusionper scrapeica/controller.go:238Tests
internal/wait-tx/waiter_test.go::TestWaiterSubscriberReceivesSubCtxNotOuterInstalls a
blockingSource(aSourcethat records its incoming ctx and blocks on<-ctx.Done()) and asserts:subscriber.Waitis invokedDonewithin ~10×setupDelay(here 250ms for setupDelay=25ms)Verified the test correctly distinguishes pre- and post-fix behaviour:
Full suite:
Follow-ups (NOT in this PR — separate work)
rpchttp.Clientper signer across the maker lifetime instead of opening one perWaitForTxInclusion. Cuts socket cost from ~1.4k/day → 3 (one per signer). Requires exposingWaitForTxInclusionWithWS(ctx, ws, txHash)in sdk-go.config.tomltemplate — setexperimental_close_on_slow_client = true, add aws_read_wait = "60s"idle-timeout, expose Prometheus subscription open/close counters. Defense-in-depth so a buggy client cannot saturate a node.