fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune#721
Merged
Merged
Conversation
…cy set GetRoom did an unconditional ZAddNX into scheduler:<name>:occupancy (added in #659 to backfill rooms predating the occupancy feature). But GetRoom is also called for rooms being/already deleted (late pings, in-flight operations), so it re-created an occupancy-only entry that no cleanup removes: GetAllRoomIDs only iterates :status, and ping-expiry/pod-gone deletion only act on status-tracked rooms. Those orphans inflate GetRunningMatchesCount forever, which pins the roomOccupancy autoscaler's CanDownscale gate and stalls major-version rollouts. A read must be side-effect free. CreateRoom already establishes the occupancy entry on creation, and UpdateRoom uses ZAddXXCh precisely to avoid resurrecting deleted rooms; reads now behave the same. Adds a regression test asserting GetRoom on a missing room does not create an occupancy member. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ries GetRunningMatchesCount summed the whole :occupancy zset, so any orphan there (a member with no matching :status entry) inflated the count, which feeds the roomOccupancy autoscaler's CanDownscale gate and the running_matches metric. The GetRoom fix stops creating orphans, but it doesn't help schedulers that already accumulated them, and operators shouldn't have to ZREM them by hand. Count only occupancy scores of rooms that still exist in :status (a running match only lives in a tracked room), via a read-only Lua script. This is self-correcting: stale occupancy members are ignored automatically, so the autoscaler and metric stay correct even before any cleanup. Updates the storage tests to register rooms in :status and adds a regression test asserting an orphan occupancy entry is not counted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first version called ZSCORE once per occupancy member to test status membership, so the Lua script held Redis's single thread for O(N) redis.call invocations. Load the status members into a Lua set with one ZRANGE, then test membership in-memory — same result, 2 redis.call total regardless of N. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lower-bound the occupancy range at (0 so Redis returns only members with a running match, instead of walking every member (ready rooms and the zero-score graveyard contribute nothing to the sum). The walk is now proportional to the number of rooms actually running matches, independent of the occupancy set size. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #721 +/- ##
==========================================
- Coverage 60.93% 60.93% -0.01%
==========================================
Files 144 144
Lines 12986 12985 -1
==========================================
- Hits 7913 7912 -1
Misses 4719 4719
Partials 354 354 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this implement/fix? Explain your changes.
The
roomOccupancyautoscaler inputrunning_matchesis the score-sum ofscheduler:<name>:occupancy. That set accumulates orphan members that are never removed, sorunning_matchesdrifts upward and never returns to reality. Once it sits above the downscale threshold,CanDownscaleis permanently false: the scheduler stops downscaling and rolling updates can't drain old-version rooms. Observed in prod::occupancyhad ~2.3k members vs ~5 in:status,running_matchespinned at 410 while the API reported 0 occupied; killing pods and switching versions didn't move it (the state is in Redis, not the pods).Two commits, prevention + immunity:
1. Stop
GetRoomfrom resurrecting deleted rooms into:occupancy.GetRoom(a read) did an unconditionalZAddNXinto:occupancy(added in #659 to backfill rooms predating the occupancy feature). ButGetRoomis also called for rooms being/already deleted (late pings, in-flight operations), so it re-created an occupancy-only entry (score 0) that no path removes —GetAllRoomIDsonly iterates:status, and ping-expiry/pod-gone cleanup only act on status-tracked rooms. Removed the write: a read must be side-effect free.CreateRoomalready establishes the occupancy entry, andUpdateRoomusesZAddXXChprecisely to avoid resurrecting deleted rooms; reads now match. After this, the only occupancy writers areCreateRoom(adds all three sets together) andUpdateRoom(XXCh, update-only) — neither can create an occupancy-only orphan.2. Make
GetRunningMatchesCountimmune to any orphan that already exists.Prevention alone doesn't help schedulers that already leaked, and operators shouldn't have to
ZREMorphans by hand.GetRunningMatchesCountnow sums occupancy scores only for rooms still present in:status(a running match only lives in a tracked room), via a read-only Lua script. This is self-correcting: stale occupancy members are ignored automatically, so the autoscaler and therunning_matchesmetric stay correct even before/without any cleanup.Net: #1 stops the leak at the source; #2 makes the autoscaler correct regardless of existing orphans — no manual reconciliation required.
Tests: added a regression test that
GetRoomon a missing room creates no occupancy entry; updated the running-matches storage tests to register rooms in:statusand added a case asserting an orphan occupancy entry is not counted. Allinternal/adapters/storage/redis/room(integration) andinternal/core/services/autoscaler/...(unit) tests pass.Does this close any currently open issues?
No tracked issue; full root-cause analysis with
file:linereferences is in the description and commits.Any relevant logs, error output, etc?
While stuck:
scheduler <x> can't downscale, occupation is above the threshold(emitted every cycle with 0 occupied rooms).Any other comments?
The
GetRoomZAddNXregression was introduced in #659 (commit7590c7fa, first released v10.11.4, 2025-02-28); present through currentmain. With #2 in place, the historical orphan entries become harmless on deploy; cleaning them from Redis is optional hygiene, no longer required for correctness.