chore(rpc): close slow WebSocket clients by default (defense-in-depth)#147
Open
mateeullahmalik wants to merge 1 commit into
Open
chore(rpc): close slow WebSocket clients by default (defense-in-depth)#147mateeullahmalik wants to merge 1 commit into
mateeullahmalik wants to merge 1 commit into
Conversation
Sets CometBFT's experimental_close_on_slow_client=true via initCometBFTConfig so any WebSocket subscriber that cannot keep up with its subscription buffer is forcibly disconnected by the server. This is defense-in-depth against the failure mode observed on lumera-devnet-1 val3, where a client-side socket leak in sdk-go (fixed in LumeraProtocol/sdk-go#17) accumulated ~5,000 ESTABLISHED WS connections, saturated the RPC listen backlog (Recv-Q=4097/4096), and made external port :26687 unresponsive — while the chain itself kept producing blocks. The sdk-go fix stops our own client from leaking. This change protects mainnet validators from any third-party client (indexer, relayer, custom tooling) exhibiting the same pattern. Subscription caps (max_subscription_clients=100, max_subscriptions_per_client=5) are left at CometBFT defaults; the new test locks them in as a guard against accidental relaxation that would re-open the saturation surface. Trade-off: well-behaved clients on high-volume subscriptions that briefly stall longer than experimental_subscription_buffer_size events will be disconnected and must reconnect. Reconnect handling is already required for operational reasons (validator restart, network blips); this is not a new client-side requirement.
Contributor
There was a problem hiding this comment.
Pull request overview
Sets CometBFT's CloseOnSlowClient = true as a default override in initCometBFTConfig() to forcibly disconnect slow WebSocket subscribers, providing server-side defense-in-depth against the RPC saturation pattern observed on devnet val3. Adds a test that pins this override along with upstream subscription caps.
Changes:
- Enable
cfg.RPC.CloseOnSlowClientwith an extensive rationale comment referencing the val3 incident and sdk-go#17. - Add
TestInitCometBFTConfigRPCHardeningto guard the override and subscription-cap defaults.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| cmd/lumera/cmd/config.go | Override CometBFT RPC CloseOnSlowClient to true with detailed in-code rationale. |
| cmd/lumera/cmd/config_test.go | New test asserting the override and the two subscription-cap upstream defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sets CometBFT's
experimental_close_on_slow_client = trueviainitCometBFTConfigso that any WebSocket subscriber that cannot keep up with its subscription buffer is forcibly disconnected by the server. Pure config override; no code-path change on the happy path.Why
Defense-in-depth against the failure mode currently observed on
lumera-devnet-1val3::26657Recv-Q=4097, Send-Q=4096):26687unresponsive — kernel silently dropping SYNsRoot cause is a client-side socket leak in
sdk-go's wait-tx subscriber, fixed in LumeraProtocol/sdk-go#17. But that fix only protects validators against our own sdk-go-based clients. A mainnet validator's:26657is publicly exposed to arbitrary clients (third-party indexers, custom relayers, bridges, monitoring tools), and any of them can re-trigger the same saturation by leaking subscriptions.CloseOnSlowClient = trueis CometBFT's native server-side circuit breaker for exactly this pattern. The server proactively closes any WS conn that backs up beyondexperimental_subscription_buffer_sizeevents instead of waiting hours for the OS / peer to time out.What this PR does NOT do
max_subscription_clients(100) ormax_subscriptions_per_client(5) — both kept at upstream defaults. A new test guards against accidental relaxation.Changes
cmd/lumera/cmd/config.go— setcfg.RPC.CloseOnSlowClient = truewith an inline comment block referencing the val3 incident and sdk-go#17 so future maintainers understand the rationalecmd/lumera/cmd/config_test.go—TestInitCometBFTConfigRPCHardeningasserts the override is applied and that the two subscription caps stay at the upstream defaultsRisk
experimental_subscription_buffer_size(default 200) events will be disconnected and must reconnect. Reconnect handling is already required for normal operations (validator restart, network blips), so this is not a new client-side requirement.cmtcfg.Configflag only; consensus, app state, IBC all untouched.Rollback
Revert the commit — restores pre-PR behaviour exactly.
Verification
Mainnet exposure
MEDIUM-HIGH — without this, any mainnet validator with public RPC remains exposed to the saturation pattern via any buggy external WS client, not just sdk-go-based ones. Should land before the next mainnet upgrade window. Not an emergency.
Follow-ups (NOT in this PR)
prometheus = trueinconfig.toml) and have the Datadog agent scrape:26660/metricsvia OpenMetrics, or add custom dogstatsd metrics from lumerad for WS subscriber open/close counters. Separate PR.ws_read_wait) — not exposed in CometBFT v0.38.x. Re-evaluate after a CometBFT bump.rpchttp.Clientper signer.Related