Skip to content

feat(redis): add configurable MaxConnAge to bound failover error window#720

Merged
lucasmonje merged 1 commit into
mainfrom
feat/redis-max-conn-age
May 28, 2026
Merged

feat(redis): add configurable MaxConnAge to bound failover error window#720
lucasmonje merged 1 commit into
mainfrom
feat/redis-max-conn-age

Conversation

@lucasmonje

Copy link
Copy Markdown
Collaborator

What does this implement/fix? Explain your changes.

Adds a configurable MaxConnAge to the go-redis client used by every maestro Redis adapter, to bound the error window during an ElastiCache failover/switchover.

Why. During a switchover (failover, Redis→Valkey engine upgrade, scaling), AWS swaps the primary/replica role of the cluster's nodes. Pooled TCP conns stay attached to the node they originally dialed, which is now a replica, and any subsequent write returns:

READONLY You can't write against a read only replica.

go-redis does not treat READONLY as a bad-conn signal, so the offending conn is returned to the pool and reused, perpetuating the failure until it's recycled by some other means. IdleTimeout doesn't help on a hot pool (conns never sit idle long enough for the reaper). MaxConnAge is checked at every conn checkout, so it guarantees turnover regardless of traffic; after the bound elapses the next dial re-resolves DNS and lands on the new primary.

What.

  • createRedisClient now reads adapters.redis.maxConnAge and sets opt.MaxConnAge when > 0. Unset/0 keeps the go-redis default (no limit), so behavior is unchanged unless configured.
  • Shared knob across all six Redis adapters, mirroring the existing adapters.redis.poolSize. Env var: MAESTRO_ADAPTERS_REDIS_MAXCONNAGE (Go duration, e.g. 5m).

Operational note. Set MAESTRO_ADAPTERS_REDIS_MAXCONNAGE=5m (or shorter) on the deployment before the planned Redis→Valkey maintenance window. Safe to leave on permanently; overhead is negligible on production-sized pools and it future-proofs against unplanned failovers.

What this does not do. This bounds the post-failover error window; it does not eliminate it. A request landing on a stale conn during the swap still sees one READONLY before that conn is recycled. A retry-on-READONLY wrapper around the room-allocation Lua script would be a complementary follow-up, out of scope here.

This mirrors the matchmaker change in MR !137 (matchmaker is on go-redis v6; maestro stays on v8, where the field is still named MaxConnAge — it becomes ConnMaxLifetime in v9).

Any other comments?

  • New //go:build unit test covers both the set case (5m reaches client.Options().MaxConnAge) and the unset-keeps-default case.
  • Existing integration tests updated with the new GetDuration expectation.
  • go build, go vet -tags=integration, goimports, license check, and the new unit tests all pass locally. Integration suite (k3d) not run locally.

During an ElastiCache switchover (failover, Redis->Valkey engine upgrade,
scaling) AWS swaps the primary/replica role of the cluster's nodes. Pooled
TCP conns stay attached to the node they originally dialed -- now a replica --
and subsequent writes return 'READONLY You can't write against a read only
replica' until the conn is recycled.

MaxConnAge is the right knob: it's checked at every conn checkout, so it
guarantees turnover regardless of traffic (unlike IdleTimeout, which never
fires on a hot pool). After the bound elapses the next dial re-resolves DNS
and lands on the new primary.

Adds a shared 'adapters.redis.maxConnAge' config knob applied in
createRedisClient. Unset/0 keeps the go-redis default (no limit), so behavior
is unchanged unless operators set MAESTRO_ADAPTERS_REDIS_MAXCONNAGE (e.g. 5m)
ahead of a maintenance window.

This bounds the post-failover error window; it does not eliminate it. A
request landing on a stale conn during the swap still sees one READONLY before
recycling. A retry-on-READONLY wrapper around the room-allocation Lua script
would be a complementary follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.92%. Comparing base (7cb6e86) to head (49ea0d4).
⚠️ Report is 2 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #720      +/-   ##
==========================================
- Coverage   60.93%   60.92%   -0.01%     
==========================================
  Files         144      144              
  Lines       12986    12992       +6     
==========================================
+ Hits         7913     7916       +3     
- Misses       4719     4722       +3     
  Partials      354      354              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fbgava-wildlife fbgava-wildlife left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lucasmonje lucasmonje merged commit 51a7be0 into main May 28, 2026
6 checks passed
@lucasmonje lucasmonje deleted the feat/redis-max-conn-age branch May 28, 2026 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants