feat(redis): add configurable MaxConnAge to bound failover error window#720
Merged
Conversation
During an ElastiCache switchover (failover, Redis->Valkey engine upgrade, scaling) AWS swaps the primary/replica role of the cluster's nodes. Pooled TCP conns stay attached to the node they originally dialed -- now a replica -- and subsequent writes return 'READONLY You can't write against a read only replica' until the conn is recycled. MaxConnAge is the right knob: it's checked at every conn checkout, so it guarantees turnover regardless of traffic (unlike IdleTimeout, which never fires on a hot pool). After the bound elapses the next dial re-resolves DNS and lands on the new primary. Adds a shared 'adapters.redis.maxConnAge' config knob applied in createRedisClient. Unset/0 keeps the go-redis default (no limit), so behavior is unchanged unless operators set MAESTRO_ADAPTERS_REDIS_MAXCONNAGE (e.g. 5m) ahead of a maintenance window. This bounds the post-failover error window; it does not eliminate it. A request landing on a stale conn during the swap still sees one READONLY before recycling. A retry-on-READONLY wrapper around the room-allocation Lua script would be a complementary follow-up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #720 +/- ##
==========================================
- Coverage 60.93% 60.92% -0.01%
==========================================
Files 144 144
Lines 12986 12992 +6
==========================================
+ Hits 7913 7916 +3
- Misses 4719 4722 +3
Partials 354 354 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this implement/fix? Explain your changes.
Adds a configurable
MaxConnAgeto the go-redis client used by every maestro Redis adapter, to bound the error window during an ElastiCache failover/switchover.Why. During a switchover (failover, Redis→Valkey engine upgrade, scaling), AWS swaps the primary/replica role of the cluster's nodes. Pooled TCP conns stay attached to the node they originally dialed, which is now a replica, and any subsequent write returns:
go-redis does not treat
READONLYas a bad-conn signal, so the offending conn is returned to the pool and reused, perpetuating the failure until it's recycled by some other means.IdleTimeoutdoesn't help on a hot pool (conns never sit idle long enough for the reaper).MaxConnAgeis checked at every conn checkout, so it guarantees turnover regardless of traffic; after the bound elapses the next dial re-resolves DNS and lands on the new primary.What.
createRedisClientnow readsadapters.redis.maxConnAgeand setsopt.MaxConnAgewhen > 0. Unset/0 keeps the go-redis default (no limit), so behavior is unchanged unless configured.adapters.redis.poolSize. Env var:MAESTRO_ADAPTERS_REDIS_MAXCONNAGE(Go duration, e.g.5m).Operational note. Set
MAESTRO_ADAPTERS_REDIS_MAXCONNAGE=5m(or shorter) on the deployment before the planned Redis→Valkey maintenance window. Safe to leave on permanently; overhead is negligible on production-sized pools and it future-proofs against unplanned failovers.What this does not do. This bounds the post-failover error window; it does not eliminate it. A request landing on a stale conn during the swap still sees one
READONLYbefore that conn is recycled. A retry-on-READONLY wrapper around the room-allocation Lua script would be a complementary follow-up, out of scope here.This mirrors the matchmaker change in MR !137 (matchmaker is on go-redis v6; maestro stays on v8, where the field is still named
MaxConnAge— it becomesConnMaxLifetimein v9).Any other comments?
//go:build unittest covers both the set case (5mreachesclient.Options().MaxConnAge) and the unset-keeps-default case.GetDurationexpectation.go build,go vet -tags=integration,goimports, license check, and the new unit tests all pass locally. Integration suite (k3d) not run locally.