Skip to content

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx#17

Merged
mateeullahmalik merged 1 commit into
mainfrom
fix/wait-tx-subscriber-leak
May 29, 2026
Merged

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx#17
mateeullahmalik merged 1 commit into
mainfrom
fix/wait-tx-subscriber-leak

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Contributor

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx

What & why

Two bugs in internal/wait-tx that together caused a steady WebSocket-subscription leak on CometBFT RPC :26657. Observed in production on lumera-devnet-1 val3: ~2,562 ESTABLISHED TCP sockets accumulated over 11 days while peer validators (same image, same uptime for val2/val5) sat at ~16. Only differentiator: val3 ran the network-maker (lumera-uploader) process, which exercises WaitForTxInclusion at high frequency.

Full RCA: ws-leak-rca.md.

Bug 1 (PRIMARY) — waiter.go:57

subCtx, cancel := context.WithTimeout(ctx, w.setupDelay)
defer cancel()
…
go func() {
    res, err := w.subscriber.Wait(ctx, txHash)   // ← OUTER ctx, not subCtx
    …
}()

select {
case <-subCtx.Done():   // fires after setupDelay (5s) → caller falls through
case <-errCh:
case res := <-resCh:
    return res, nil
}

When subCtx fired the caller fell through to the gRPC poller — but the goroutine kept blocking on <-ch inside subscriber.Wait for the lifetime of the outer ctx. Its defer client.Stop() (the only thing that closes the gorilla WS conn) did not run until the outer ctx expired. For many sdk-go consumers (NM-style batch tasks with no per-call deadline) that means never — each abandoned goroutine pinned one rpchttp.Client open and therefore one WS conn to lumerad until the parent process restarted.

Fix: pass subCtx (not ctx) to subscriber.Wait. Once setupDelay fires, the goroutine's <-ctx.Done() returns immediately, defers unwind (Unsubscribe + Stop), gorilla WS is closed, server-side socket hits CLOSE_WAIT and is reaped.

Bug 2 (SECONDARY) — subscriber.go

  • subscriberID was the package-level constant "sdk-go-wait". Any future code path that shares one rpchttp.Client across concurrent Subscribe calls would collide on cometbft's client-side WSEvents.subscriptions[query] map. Today each Wait() builds its own client so this is latent, but rotated IDs (sdk-go-wait-<pid>-<seq>) keep the design forward-safe for the planned client-reuse refactor in lumera-uploader.
  • defer client.Unsubscribe(context.Background(), …) had no timeout. A slow/unresponsive server could block teardown indefinitely. Bounded to 2s; we proceed to client.Stop() regardless (which closes the socket unconditionally).
  • Added explicit defer func() { _ = client.Stop() }() pattern so no future refactor can accidentally short-circuit the Stop() call.

Risk

  • Behaviour change only on the timeout/error paths that previously leaked goroutines. Happy-path latency is unchanged — same one round-trip Subscribe, same select-loop, same first-event return.
  • No public API change.
  • No breaking change for callers; New() signature, Wait() signature, Source interface, all unchanged.
  • Rollback: revert this commit; reverts to pre-fix behaviour.

Mainnet exposure

HIGH for any validator exposing RPC publicly. Any sdk-go consumer that calls WaitForTxInclusion in a retry loop without a per-call deadline will leak WS sockets at ~20%/attempt stickiness (measured rate on val3). At ~1k failed attempts/day per client, a node with default max_open_connections=900 saturates within hours of activity.

Likely external consumers:

  • Indexers / explorers wrapping WaitForTxInclusion per scrape
  • Paymasters / relayers (fee-grant batchers)
  • Bridges via ica/controller.go:238
  • Any cron task with NM-like batch behaviour

Tests

internal/wait-tx/waiter_test.go::TestWaiterSubscriberReceivesSubCtxNotOuter

Installs a blockingSource (a Source that records its incoming ctx and blocks on <-ctx.Done()) and asserts:

  1. subscriber.Wait is invoked
  2. Its recorded ctx becomes Done within ~10× setupDelay (here 250ms for setupDelay=25ms)
  3. The goroutine actually unwinds
  4. The poller fallback is invoked exactly once

Verified the test correctly distinguishes pre- and post-fix behaviour:

# On master (before fix):
$ git stash push -- internal/wait-tx/waiter.go
$ go test -run TestWaiterSubscriberReceivesSubCtxNotOuter ./internal/wait-tx/
--- FAIL: TestWaiterSubscriberReceivesSubCtxNotOuter (0.28s)
    waiter_test.go:159: subscriber goroutine's ctx never fired within 10x setupDelay; goroutine is bound to outer ctx (leak)
FAIL

# With fix applied:
$ git stash pop
$ go test -count=1 -timeout 30s ./internal/wait-tx/...
ok  	github.com/LumeraProtocol/sdk-go/internal/wait-tx	0.043s

Full suite:

$ go test -count=1 -timeout 120s ./...
ok  	github.com/LumeraProtocol/sdk-go/blockchain/base	0.042s
ok  	github.com/LumeraProtocol/sdk-go/cascade	0.062s
ok  	github.com/LumeraProtocol/sdk-go/ica	0.047s
ok  	github.com/LumeraProtocol/sdk-go/internal/wait-tx	0.044s
ok  	github.com/LumeraProtocol/sdk-go/pkg/crypto	0.406s

Follow-ups (NOT in this PR — separate work)

  • lumera-uploader: refactor to share one rpchttp.Client per signer across the maker lifetime instead of opening one per WaitForTxInclusion. Cuts socket cost from ~1.4k/day → 3 (one per signer). Requires exposing WaitForTxInclusionWithWS(ctx, ws, txHash) in sdk-go.
  • lumera (chain): patch the default config.toml template — set experimental_close_on_slow_client = true, add a ws_read_wait = "60s" idle-timeout, expose Prometheus subscription open/close counters. Defense-in-depth so a buggy client cannot saturate a node.
  • Integration test in lumera-uploader: submit 1k failing-file requests and assert <50 lingering WS sockets afterwards.

Two bugs in internal/wait-tx that together caused a steady WebSocket-
subscription leak on CometBFT RPC :26657. Observed in production on
lumera-devnet-1 val3: ~2562 ESTABLISHED sockets accumulated over 11
days while peer validators (same image, same uptime) sat at ~16.

Bug 1 (PRIMARY) — waiter.go:57
The subscriber goroutine was invoked with the OUTER ctx instead of
subCtx. When setupDelay (default 5s) fired, the caller returned via
the subCtx.Done() arm and fell through to the gRPC poller — but the
goroutine kept blocking on <-ch inside subscriber.Wait for the
lifetime of the outer ctx. Its 'defer client.Stop()' therefore did
not run until the outer ctx expired, which for many sdk-go consumers
(network-maker / NM-style batch tasks with no per-call deadline)
meant 'never'. Each abandoned goroutine pinned one rpchttp.Client
open and therefore one WS conn to lumerad.

Fix: pass subCtx (not ctx) to subscriber.Wait. Once setupDelay fires,
the goroutine's <-ctx.Done() returns immediately, defers unwind
(Unsubscribe + Stop), gorilla.WS conn is closed, server-side socket
hits CLOSE_WAIT and is reaped.

Bug 2 (SECONDARY) — subscriber.go
- subscriberID was a package-level constant 'sdk-go-wait', meaning
  any future code path that shares an rpchttp.Client across concurrent
  Subscribe calls would collide on cometbft's client-side query map.
  Today each Wait() builds its own client so this is latent, but rotate
  IDs now to keep the design forward-safe for client reuse.
- defer client.Unsubscribe(context.Background(), ...) had no timeout.
  If the server is slow/unresponsive, Unsubscribe could block teardown
  indefinitely, defeating the whole point of guaranteed Stop(). Bound
  it to 2s; we proceed to client.Stop() regardless, which closes the
  socket unconditionally.
- Added 'always-stop' defer pattern (anonymous func wrapping
  client.Stop()) for clarity and to ensure no future refactor accidentally
  short-circuits the Stop() call.

Regression test: internal/wait-tx/waiter_test.go
TestWaiterSubscriberReceivesSubCtxNotOuter installs a blocking source
that records its incoming ctx and asserts it becomes Done within 10x
setupDelay. Without this fix the test FAILS with 'goroutine is bound
to outer ctx (leak)'; with the fix it PASSES in ~30ms.

Mainnet exposure:
HIGH for any validator with public RPC. Any sdk-go consumer that
calls WaitForTxInclusion in a retry loop without a per-call deadline
will leak WS sockets at ~20%/attempt stickiness, saturating
max_open_connections=900 within a couple hours of activity.

Risk on this PR: low. Behaviour change only on the timeout/error
paths that previously leaked goroutines; happy-path latency is
unchanged. No public API change.

Refs: ops RCA at /root/lumera/ws-leak-rca.md
Copy link
Copy Markdown
Contributor Author

@mateeullahmalik mateeullahmalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production-gate review — fix(wait-tx): plug WS-subscriber leak

HEAD SHA reviewed: d93da7fcdbf909dcba591422fc94694838c19e3d
Method: code re-read at tip + race-enabled test run + consensus-killer grep catalog + caller-impact audit.


✅ Correctness — fix matches the documented bug shape

waiter.go (Bug 1): the goroutine now receives subCtx instead of the outer ctx. Traced control-flow at tip:

  • subCtx.Done() fires after setupDelay → caller select returns
  • the spawned goroutine, also bound to subCtx, sees ctx.Done() inside subscriber.Wait on its next channel-receive
  • defers unwind LIFO → Unsubscribe (bounded 2s) → client.Stop() → gorilla WS closed → server-side socket reaped
  • buffered resCh/errCh (cap=1) prevent the goroutine blocking on the send if the caller has already moved on → no goroutine leak

subscriber.go (Bug 2):

  • newSubscriberID() uses os.Getpid() + atomic.AddUint64(&subscriberSeq, …) — unique per call, no shared-state collision risk
  • defer ordering verified: client.Stop() registered first → Unsubscribe registered second → LIFO runs Unsubscribe then Stop. Correct: must unsubscribe before stopping the client so the unsubscribe RPC isn't sent over a dead conn
  • unsubscribeTimeout = 2s bounds teardown; client.Stop() runs unconditionally on failure → socket closes regardless

✅ Tests — race-clean, regression-targeted

$ go test -race -count=1 -timeout 60s ./internal/wait-tx/...
ok      github.com/LumeraProtocol/sdk-go/internal/wait-tx       1.094s

Full suite + vet clean:

$ go test -race -count=1 ./...   →  all PASS
$ go vet ./...                   →  clean

TestWaiterSubscriberReceivesSubCtxNotOuter correctly distinguishes pre- vs post-fix behaviour (verified by inverting the diff: test fails on master). Test asserts (1) goroutine's recorded ctx fires within 10× setupDelay, (2) goroutine actually unwinds, (3) poller is invoked exactly once. Tight bound (250ms) but generous enough for loaded CI — not flaky-prone.

✅ Caller-impact audit

WaitForTxInclusion is consumed at 11 internal sites (blockchain/supernode.go ×6, blockchain/action.go ×4, ica/controller.go). No public API change (New, Wait, Source signatures all stable). Happy-path latency unchanged (same one round-trip Subscribe, same first-event return). All 11 consumers benefit from the leak fix automatically — no caller changes required.

✅ Consensus-killer grep catalog (per chain-PR safety methodology, applied even though this is client-side)

Class Hits on changed surface Status
Float arithmetic 0
time.Now / wall-clock 0 (only time.Second literals)
rand.* / non-deterministic 0
Map iteration order range txev.Result.Events + inner range e.Attributes — but flattens into a client-side return map, zero consensus surface
Panic class (MustGet, MustUnmarshal, etc.) 0
Binary decode without length check 0
Division/modulo with data-driven divisor 0
Slice indexing from data 0
Goroutine / channel 1 goroutine, buffered channels, bound to subCtx ✓ no leak

Zero consensus-state surface (sdk-go is a client library; not in EndBlock / BeginBlock / DeliverTx / InitGenesis path). Cannot halt chain. Worst-case-wrong = unchanged from pre-fix behaviour.

⚠ Observations (non-blocking, follow-up only)

  1. No log/metric when subscriber falls through to poller. If :26657 is mis-configured or unreachable, the subscriber path silently degrades and only the poller's slower latency is observed. Operationally invisible. Suggest a debug-level log + a counter — but not a merge-blocker; this is identical to pre-fix observability.
  2. select race on simultaneous resCh ready + subCtx.Done. When both arms are ready, Go picks randomly; if subCtx.Done() wins, the subscriber's valid Result is discarded and we fall through to the poller (which will fetch the same tx). Correct outcome, slight latency cost. Acceptable.
  3. Stop()/Unsubscribe finish slightly after Wait() returns (background defers in the goroutine). For up to ~2s after return, one WS socket may still be ESTABLISHED. Not a leak — it WILL close. Just a brief tail. Calling WaitForTxInclusion in a tight loop won't exceed 1 socket × concurrent-calls — bounded.
  4. unsubscribeTimeout is a package-private constant (2s). Reasonable default; not worth exposing through config unless we see real-world server-side slow-unsubscribe pathologies.

📌 Live-production confirmation

The bug shape is currently re-manifesting on lumera-devnet-1 val3 as we speak:

  • ss -ltn sport = :26657 → Recv-Q 4097, Send-Q 4096 (listen backlog FULL)
  • ~4,998 ESTABLISHED conns on :26657 (val1/2/4/5 sit at ~16)
  • external port :26687 is now unresponsive (kernel dropping SYNs)

This PR is the actual fix for the live degradation, not a theoretical one.


Verdict

🟢 SAFE TO MERGE.

Zero consensus impact (client library). Real fix to a confirmed, currently-manifesting production bug. Tests are race-clean and regression-protective at the exact bug shape. No caller changes required.

After merge, lumera-uploader needs a rebuild against the new sdk-go tag to pick up the fix on lumera-devnet-1 val3 — and ideally the val3 lumerad container restarted to fully drain the existing 5k accumulated subscribers.

Follow-ups (separate work, already enumerated in PR description):

  • lumera-uploader: share one rpchttp.Client per signer (reduces socket churn ~500×)
  • lumera chain: experimental_close_on_slow_client = true + idle WS timeout in default template as defense-in-depth
  • integration test in lumera-uploader: submit-N-failing-txs and assert <N lingering sockets

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a WebSocket subscriber lifecycle leak in internal/wait-tx by ensuring subscriber goroutines are bound to the short setup context and by tightening subscriber teardown behavior.

Changes:

  • Passes subCtx into subscriber.Wait so abandoned subscriber goroutines unwind after setupDelay.
  • Adds per-call CometBFT subscriber IDs and bounds unsubscribe teardown with a timeout.
  • Adds a regression test proving the subscriber goroutine receives the setup context and falls back to the poller.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
internal/wait-tx/waiter.go Binds the subscriber goroutine to subCtx to prevent long-lived WS connections after fallback.
internal/wait-tx/subscriber.go Adds unique subscriber IDs and bounded unsubscribe cleanup before stopping the RPC client.
internal/wait-tx/waiter_test.go Adds regression coverage for the subscriber context lifetime and poller fallback behavior.

@mateeullahmalik mateeullahmalik merged commit 5b29020 into main May 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants