fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx by mateeullahmalik · Pull Request #17 · LumeraProtocol/sdk-go

mateeullahmalik · 2026-05-26T11:56:51Z

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx

What & why

Two bugs in internal/wait-tx that together caused a steady WebSocket-subscription leak on CometBFT RPC :26657. Observed in production on lumera-devnet-1 val3: ~2,562 ESTABLISHED TCP sockets accumulated over 11 days while peer validators (same image, same uptime for val2/val5) sat at ~16. Only differentiator: val3 ran the network-maker (lumera-uploader) process, which exercises WaitForTxInclusion at high frequency.

Full RCA: ws-leak-rca.md.

Bug 1 (PRIMARY) — `waiter.go:57`

subCtx, cancel := context.WithTimeout(ctx, w.setupDelay)
defer cancel()
…
go func() {
    res, err := w.subscriber.Wait(ctx, txHash)   // ← OUTER ctx, not subCtx
    …
}()

select {
case <-subCtx.Done():   // fires after setupDelay (5s) → caller falls through
case <-errCh:
case res := <-resCh:
    return res, nil
}

When subCtx fired the caller fell through to the gRPC poller — but the goroutine kept blocking on <-ch inside subscriber.Wait for the lifetime of the outer ctx. Its defer client.Stop() (the only thing that closes the gorilla WS conn) did not run until the outer ctx expired. For many sdk-go consumers (NM-style batch tasks with no per-call deadline) that means never — each abandoned goroutine pinned one rpchttp.Client open and therefore one WS conn to lumerad until the parent process restarted.

Fix: pass subCtx (not ctx) to subscriber.Wait. Once setupDelay fires, the goroutine's <-ctx.Done() returns immediately, defers unwind (Unsubscribe + Stop), gorilla WS is closed, server-side socket hits CLOSE_WAIT and is reaped.

Bug 2 (SECONDARY) — `subscriber.go`

subscriberID was the package-level constant "sdk-go-wait". Any future code path that shares one rpchttp.Client across concurrent Subscribe calls would collide on cometbft's client-side WSEvents.subscriptions[query] map. Today each Wait() builds its own client so this is latent, but rotated IDs (sdk-go-wait-<pid>-<seq>) keep the design forward-safe for the planned client-reuse refactor in lumera-uploader.
defer client.Unsubscribe(context.Background(), …) had no timeout. A slow/unresponsive server could block teardown indefinitely. Bounded to 2s; we proceed to client.Stop() regardless (which closes the socket unconditionally).
Added explicit defer func() { _ = client.Stop() }() pattern so no future refactor can accidentally short-circuit the Stop() call.

Risk

Behaviour change only on the timeout/error paths that previously leaked goroutines. Happy-path latency is unchanged — same one round-trip Subscribe, same select-loop, same first-event return.
No public API change.
No breaking change for callers; New() signature, Wait() signature, Source interface, all unchanged.
Rollback: revert this commit; reverts to pre-fix behaviour.

Mainnet exposure

HIGH for any validator exposing RPC publicly. Any sdk-go consumer that calls WaitForTxInclusion in a retry loop without a per-call deadline will leak WS sockets at ~20%/attempt stickiness (measured rate on val3). At ~1k failed attempts/day per client, a node with default max_open_connections=900 saturates within hours of activity.

Likely external consumers:

Indexers / explorers wrapping WaitForTxInclusion per scrape
Paymasters / relayers (fee-grant batchers)
Bridges via ica/controller.go:238
Any cron task with NM-like batch behaviour

Tests

internal/wait-tx/waiter_test.go::TestWaiterSubscriberReceivesSubCtxNotOuter

Installs a blockingSource (a Source that records its incoming ctx and blocks on <-ctx.Done()) and asserts:

subscriber.Wait is invoked
Its recorded ctx becomes Done within ~10× setupDelay (here 250ms for setupDelay=25ms)
The goroutine actually unwinds
The poller fallback is invoked exactly once

Verified the test correctly distinguishes pre- and post-fix behaviour:

# On master (before fix):
$ git stash push -- internal/wait-tx/waiter.go
$ go test -run TestWaiterSubscriberReceivesSubCtxNotOuter ./internal/wait-tx/
--- FAIL: TestWaiterSubscriberReceivesSubCtxNotOuter (0.28s)
    waiter_test.go:159: subscriber goroutine's ctx never fired within 10x setupDelay; goroutine is bound to outer ctx (leak)
FAIL

# With fix applied:
$ git stash pop
$ go test -count=1 -timeout 30s ./internal/wait-tx/...
ok  	github.com/LumeraProtocol/sdk-go/internal/wait-tx	0.043s

Full suite:

$ go test -count=1 -timeout 120s ./...
ok  	github.com/LumeraProtocol/sdk-go/blockchain/base	0.042s
ok  	github.com/LumeraProtocol/sdk-go/cascade	0.062s
ok  	github.com/LumeraProtocol/sdk-go/ica	0.047s
ok  	github.com/LumeraProtocol/sdk-go/internal/wait-tx	0.044s
ok  	github.com/LumeraProtocol/sdk-go/pkg/crypto	0.406s

Follow-ups (NOT in this PR — separate work)

lumera-uploader: refactor to share one rpchttp.Client per signer across the maker lifetime instead of opening one per WaitForTxInclusion. Cuts socket cost from ~1.4k/day → 3 (one per signer). Requires exposing WaitForTxInclusionWithWS(ctx, ws, txHash) in sdk-go.
lumera (chain): patch the default config.toml template — set experimental_close_on_slow_client = true, add a ws_read_wait = "60s" idle-timeout, expose Prometheus subscription open/close counters. Defense-in-depth so a buggy client cannot saturate a node.
Integration test in lumera-uploader: submit 1k failing-file requests and assert <50 lingering WS sockets afterwards.

Two bugs in internal/wait-tx that together caused a steady WebSocket- subscription leak on CometBFT RPC :26657. Observed in production on lumera-devnet-1 val3: ~2562 ESTABLISHED sockets accumulated over 11 days while peer validators (same image, same uptime) sat at ~16. Bug 1 (PRIMARY) — waiter.go:57 The subscriber goroutine was invoked with the OUTER ctx instead of subCtx. When setupDelay (default 5s) fired, the caller returned via the subCtx.Done() arm and fell through to the gRPC poller — but the goroutine kept blocking on <-ch inside subscriber.Wait for the lifetime of the outer ctx. Its 'defer client.Stop()' therefore did not run until the outer ctx expired, which for many sdk-go consumers (network-maker / NM-style batch tasks with no per-call deadline) meant 'never'. Each abandoned goroutine pinned one rpchttp.Client open and therefore one WS conn to lumerad. Fix: pass subCtx (not ctx) to subscriber.Wait. Once setupDelay fires, the goroutine's <-ctx.Done() returns immediately, defers unwind (Unsubscribe + Stop), gorilla.WS conn is closed, server-side socket hits CLOSE_WAIT and is reaped. Bug 2 (SECONDARY) — subscriber.go - subscriberID was a package-level constant 'sdk-go-wait', meaning any future code path that shares an rpchttp.Client across concurrent Subscribe calls would collide on cometbft's client-side query map. Today each Wait() builds its own client so this is latent, but rotate IDs now to keep the design forward-safe for client reuse. - defer client.Unsubscribe(context.Background(), ...) had no timeout. If the server is slow/unresponsive, Unsubscribe could block teardown indefinitely, defeating the whole point of guaranteed Stop(). Bound it to 2s; we proceed to client.Stop() regardless, which closes the socket unconditionally. - Added 'always-stop' defer pattern (anonymous func wrapping client.Stop()) for clarity and to ensure no future refactor accidentally short-circuits the Stop() call. Regression test: internal/wait-tx/waiter_test.go TestWaiterSubscriberReceivesSubCtxNotOuter installs a blocking source that records its incoming ctx and asserts it becomes Done within 10x setupDelay. Without this fix the test FAILS with 'goroutine is bound to outer ctx (leak)'; with the fix it PASSES in ~30ms. Mainnet exposure: HIGH for any validator with public RPC. Any sdk-go consumer that calls WaitForTxInclusion in a retry loop without a per-call deadline will leak WS sockets at ~20%/attempt stickiness, saturating max_open_connections=900 within a couple hours of activity. Risk on this PR: low. Behaviour change only on the timeout/error paths that previously leaked goroutines; happy-path latency is unchanged. No public API change. Refs: ops RCA at /root/lumera/ws-leak-rca.md

mateeullahmalik

Production-gate review — `fix(wait-tx): plug WS-subscriber leak`

HEAD SHA reviewed: d93da7fcdbf909dcba591422fc94694838c19e3d
Method: code re-read at tip + race-enabled test run + consensus-killer grep catalog + caller-impact audit.

✅ Correctness — fix matches the documented bug shape

waiter.go (Bug 1): the goroutine now receives subCtx instead of the outer ctx. Traced control-flow at tip:

subCtx.Done() fires after setupDelay → caller select returns
the spawned goroutine, also bound to subCtx, sees ctx.Done() inside subscriber.Wait on its next channel-receive
defers unwind LIFO → Unsubscribe (bounded 2s) → client.Stop() → gorilla WS closed → server-side socket reaped
buffered resCh/errCh (cap=1) prevent the goroutine blocking on the send if the caller has already moved on → no goroutine leak

subscriber.go (Bug 2):

newSubscriberID() uses os.Getpid() + atomic.AddUint64(&subscriberSeq, …) — unique per call, no shared-state collision risk
defer ordering verified: client.Stop() registered first → Unsubscribe registered second → LIFO runs Unsubscribe then Stop. Correct: must unsubscribe before stopping the client so the unsubscribe RPC isn't sent over a dead conn
unsubscribeTimeout = 2s bounds teardown; client.Stop() runs unconditionally on failure → socket closes regardless

✅ Tests — race-clean, regression-targeted

$ go test -race -count=1 -timeout 60s ./internal/wait-tx/...
ok      github.com/LumeraProtocol/sdk-go/internal/wait-tx       1.094s

Full suite + vet clean:

$ go test -race -count=1 ./...   →  all PASS
$ go vet ./...                   →  clean

TestWaiterSubscriberReceivesSubCtxNotOuter correctly distinguishes pre- vs post-fix behaviour (verified by inverting the diff: test fails on master). Test asserts (1) goroutine's recorded ctx fires within 10× setupDelay, (2) goroutine actually unwinds, (3) poller is invoked exactly once. Tight bound (250ms) but generous enough for loaded CI — not flaky-prone.

✅ Caller-impact audit

WaitForTxInclusion is consumed at 11 internal sites (blockchain/supernode.go ×6, blockchain/action.go ×4, ica/controller.go). No public API change (New, Wait, Source signatures all stable). Happy-path latency unchanged (same one round-trip Subscribe, same first-event return). All 11 consumers benefit from the leak fix automatically — no caller changes required.

✅ Consensus-killer grep catalog (per chain-PR safety methodology, applied even though this is client-side)

Class	Hits on changed surface	Status
Float arithmetic	0	✓
`time.Now` / wall-clock	0 (only `time.Second` literals)	✓
`rand.*` / non-deterministic	0	✓
Map iteration order	`range txev.Result.Events` + inner `range e.Attributes` — but flattens into a client-side return map, zero consensus surface	✓
Panic class (`MustGet`, `MustUnmarshal`, etc.)	0	✓
Binary decode without length check	0	✓
Division/modulo with data-driven divisor	0	✓
Slice indexing from data	0	✓
Goroutine / channel	1 goroutine, buffered channels, bound to subCtx	✓ no leak

Zero consensus-state surface (sdk-go is a client library; not in EndBlock / BeginBlock / DeliverTx / InitGenesis path). Cannot halt chain. Worst-case-wrong = unchanged from pre-fix behaviour.

⚠ Observations (non-blocking, follow-up only)

No log/metric when subscriber falls through to poller. If :26657 is mis-configured or unreachable, the subscriber path silently degrades and only the poller's slower latency is observed. Operationally invisible. Suggest a debug-level log + a counter — but not a merge-blocker; this is identical to pre-fix observability.
select race on simultaneous resCh ready + subCtx.Done. When both arms are ready, Go picks randomly; if subCtx.Done() wins, the subscriber's valid Result is discarded and we fall through to the poller (which will fetch the same tx). Correct outcome, slight latency cost. Acceptable.
Stop()/Unsubscribe finish slightly after Wait() returns (background defers in the goroutine). For up to ~2s after return, one WS socket may still be ESTABLISHED. Not a leak — it WILL close. Just a brief tail. Calling WaitForTxInclusion in a tight loop won't exceed 1 socket × concurrent-calls — bounded.
unsubscribeTimeout is a package-private constant (2s). Reasonable default; not worth exposing through config unless we see real-world server-side slow-unsubscribe pathologies.

📌 Live-production confirmation

The bug shape is currently re-manifesting on lumera-devnet-1 val3 as we speak:

ss -ltn sport = :26657 → Recv-Q 4097, Send-Q 4096 (listen backlog FULL)
~4,998 ESTABLISHED conns on :26657 (val1/2/4/5 sit at ~16)
external port :26687 is now unresponsive (kernel dropping SYNs)

This PR is the actual fix for the live degradation, not a theoretical one.

Verdict

🟢 SAFE TO MERGE.

Zero consensus impact (client library). Real fix to a confirmed, currently-manifesting production bug. Tests are race-clean and regression-protective at the exact bug shape. No caller changes required.

After merge, lumera-uploader needs a rebuild against the new sdk-go tag to pick up the fix on lumera-devnet-1 val3 — and ideally the val3 lumerad container restarted to fully drain the existing 5k accumulated subscribers.

Follow-ups (separate work, already enumerated in PR description):

lumera-uploader: share one rpchttp.Client per signer (reduces socket churn ~500×)
lumera chain: experimental_close_on_slow_client = true + idle WS timeout in default template as defense-in-depth
integration test in lumera-uploader: submit-N-failing-txs and assert <N lingering sockets

Copilot

Pull request overview

This PR fixes a WebSocket subscriber lifecycle leak in internal/wait-tx by ensuring subscriber goroutines are bound to the short setup context and by tightening subscriber teardown behavior.

Changes:

Passes subCtx into subscriber.Wait so abandoned subscriber goroutines unwind after setupDelay.
Adds per-call CometBFT subscriber IDs and bounds unsubscribe teardown with a timeout.
Adds a regression test proving the subscriber goroutine receives the setup context and falls back to the poller.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
`internal/wait-tx/waiter.go`	Binds the subscriber goroutine to `subCtx` to prevent long-lived WS connections after fallback.
`internal/wait-tx/subscriber.go`	Adds unique subscriber IDs and bounded unsubscribe cleanup before stopping the RPC client.
`internal/wait-tx/waiter_test.go`	Adds regression coverage for the subscriber context lifetime and poller fallback behavior.

mateeullahmalik commented May 28, 2026

View reviewed changes

mateeullahmalik requested a review from akobrin1 May 28, 2026 19:14

mateeullahmalik self-assigned this May 28, 2026

mateeullahmalik requested a review from j-rafique May 28, 2026 19:14

a-ok123 requested a review from Copilot May 29, 2026 01:33

Copilot started reviewing on behalf of a-ok123 May 29, 2026 01:33 View session

Copilot AI reviewed May 29, 2026

View reviewed changes

j-rafique approved these changes May 29, 2026

View reviewed changes

mateeullahmalik merged commit 5b29020 into main May 29, 2026
1 check passed

mateeullahmalik mentioned this pull request May 29, 2026

chore(rpc): close slow WebSocket clients by default (defense-in-depth) LumeraProtocol/lumera#147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx#17

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx#17
mateeullahmalik merged 1 commit into
mainfrom
fix/wait-tx-subscriber-leak

mateeullahmalik commented May 26, 2026

Uh oh!

mateeullahmalik left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mateeullahmalik commented May 26, 2026

fix(wait-tx): plug WS-subscriber leak; bind goroutine to subCtx

What & why

Bug 1 (PRIMARY) — waiter.go:57

Bug 2 (SECONDARY) — subscriber.go

Risk

Mainnet exposure

Tests

Follow-ups (NOT in this PR — separate work)

Uh oh!

mateeullahmalik left a comment

Choose a reason for hiding this comment

Production-gate review — fix(wait-tx): plug WS-subscriber leak

✅ Correctness — fix matches the documented bug shape

✅ Tests — race-clean, regression-targeted

✅ Caller-impact audit

✅ Consensus-killer grep catalog (per chain-PR safety methodology, applied even though this is client-side)

⚠ Observations (non-blocking, follow-up only)

📌 Live-production confirmation

Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug 1 (PRIMARY) — `waiter.go:57`

Bug 2 (SECONDARY) — `subscriber.go`

Production-gate review — `fix(wait-tx): plug WS-subscriber leak`