Skip to content

chore(rpc): close slow WebSocket clients by default (defense-in-depth)#147

Open
mateeullahmalik wants to merge 1 commit into
masterfrom
chore/rpc-ws-hardening
Open

chore(rpc): close slow WebSocket clients by default (defense-in-depth)#147
mateeullahmalik wants to merge 1 commit into
masterfrom
chore/rpc-ws-hardening

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Contributor

Summary

Sets CometBFT's experimental_close_on_slow_client = true via initCometBFTConfig so that any WebSocket subscriber that cannot keep up with its subscription buffer is forcibly disconnected by the server. Pure config override; no code-path change on the happy path.

Why

Defense-in-depth against the failure mode currently observed on lumera-devnet-1 val3:

  • ~5,000 ESTABLISHED WS connections on :26657
  • RPC listen queue full (Recv-Q=4097, Send-Q=4096)
  • External RPC port :26687 unresponsive — kernel silently dropping SYNs
  • Chain itself stayed healthy (P2P :26656 OK, blocks producing) — only the RPC face went dark

Root cause is a client-side socket leak in sdk-go's wait-tx subscriber, fixed in LumeraProtocol/sdk-go#17. But that fix only protects validators against our own sdk-go-based clients. A mainnet validator's :26657 is publicly exposed to arbitrary clients (third-party indexers, custom relayers, bridges, monitoring tools), and any of them can re-trigger the same saturation by leaking subscriptions.

CloseOnSlowClient = true is CometBFT's native server-side circuit breaker for exactly this pattern. The server proactively closes any WS conn that backs up beyond experimental_subscription_buffer_size events instead of waiting hours for the OS / peer to time out.

What this PR does NOT do

  • Does not add Prometheus / observability counters (separate PR — devnet is currently 100% Datadog, no Prometheus pipeline)
  • Does not change max_subscription_clients (100) or max_subscriptions_per_client (5) — both kept at upstream defaults. A new test guards against accidental relaxation.
  • Does not change anything client-side. sdk-go#17 is the matching client-side fix and ships independently.

Changes

  • cmd/lumera/cmd/config.go — set cfg.RPC.CloseOnSlowClient = true with an inline comment block referencing the val3 incident and sdk-go#17 so future maintainers understand the rationale
  • cmd/lumera/cmd/config_test.goTestInitCometBFTConfigRPCHardening asserts the override is applied and that the two subscription caps stay at the upstream defaults

Risk

  • Behaviour change is server-side only and only on the failure path. Happy-path subscribers (drain events promptly) are unaffected.
  • Worst-case for well-behaved clients: a client on a high-volume subscription that briefly stalls longer than experimental_subscription_buffer_size (default 200) events will be disconnected and must reconnect. Reconnect handling is already required for normal operations (validator restart, network blips), so this is not a new client-side requirement.
  • No protocol / state-machine surface. cmtcfg.Config flag only; consensus, app state, IBC all untouched.
  • No migration / upgrade-handler implications.

Rollback

Revert the commit — restores pre-PR behaviour exactly.

Verification

$ go vet ./cmd/lumera/cmd/...
$ go test -count=1 -timeout 90s ./cmd/lumera/cmd/...
ok  	github.com/LumeraProtocol/lumera/cmd/lumera/cmd	0.261s
$ go build ./cmd/...

Mainnet exposure

MEDIUM-HIGH — without this, any mainnet validator with public RPC remains exposed to the saturation pattern via any buggy external WS client, not just sdk-go-based ones. Should land before the next mainnet upgrade window. Not an emergency.

Follow-ups (NOT in this PR)

  • Observability: enable CometBFT's existing Prometheus endpoint (prometheus = true in config.toml) and have the Datadog agent scrape :26660/metrics via OpenMetrics, or add custom dogstatsd metrics from lumerad for WS subscriber open/close counters. Separate PR.
  • Idle WS timeout (ws_read_wait) — not exposed in CometBFT v0.38.x. Re-evaluate after a CometBFT bump.
  • lumera-uploader rebuild against the merged sdk-go (PR Fix proto import mapping #17) tag, plus client-side refactor to share rpchttp.Client per signer.

Related

Sets CometBFT's experimental_close_on_slow_client=true via initCometBFTConfig
so any WebSocket subscriber that cannot keep up with its subscription buffer
is forcibly disconnected by the server. This is defense-in-depth against the
failure mode observed on lumera-devnet-1 val3, where a client-side socket
leak in sdk-go (fixed in LumeraProtocol/sdk-go#17) accumulated ~5,000
ESTABLISHED WS connections, saturated the RPC listen backlog
(Recv-Q=4097/4096), and made external port :26687 unresponsive — while the
chain itself kept producing blocks.

The sdk-go fix stops our own client from leaking. This change protects
mainnet validators from any third-party client (indexer, relayer, custom
tooling) exhibiting the same pattern.

Subscription caps (max_subscription_clients=100, max_subscriptions_per_client=5)
are left at CometBFT defaults; the new test locks them in as a guard against
accidental relaxation that would re-open the saturation surface.

Trade-off: well-behaved clients on high-volume subscriptions that briefly
stall longer than experimental_subscription_buffer_size events will be
disconnected and must reconnect. Reconnect handling is already required for
operational reasons (validator restart, network blips); this is not a new
client-side requirement.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Sets CometBFT's CloseOnSlowClient = true as a default override in initCometBFTConfig() to forcibly disconnect slow WebSocket subscribers, providing server-side defense-in-depth against the RPC saturation pattern observed on devnet val3. Adds a test that pins this override along with upstream subscription caps.

Changes:

  • Enable cfg.RPC.CloseOnSlowClient with an extensive rationale comment referencing the val3 incident and sdk-go#17.
  • Add TestInitCometBFTConfigRPCHardening to guard the override and subscription-cap defaults.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
cmd/lumera/cmd/config.go Override CometBFT RPC CloseOnSlowClient to true with detailed in-code rationale.
cmd/lumera/cmd/config_test.go New test asserting the override and the two subscription-cap upstream defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants