Skip to content

fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune#721

Merged
lucasmonje merged 4 commits into
mainfrom
fix/getroom-occupancy-resurrection
Jun 16, 2026
Merged

fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune#721
lucasmonje merged 4 commits into
mainfrom
fix/getroom-occupancy-resurrection

Conversation

@lucasmonje

@lucasmonje lucasmonje commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What does this implement/fix? Explain your changes.

The roomOccupancy autoscaler input running_matches is the score-sum of scheduler:<name>:occupancy. That set accumulates orphan members that are never removed, so running_matches drifts upward and never returns to reality. Once it sits above the downscale threshold, CanDownscale is permanently false: the scheduler stops downscaling and rolling updates can't drain old-version rooms. Observed in prod: :occupancy had ~2.3k members vs ~5 in :status, running_matches pinned at 410 while the API reported 0 occupied; killing pods and switching versions didn't move it (the state is in Redis, not the pods).

Two commits, prevention + immunity:

1. Stop GetRoom from resurrecting deleted rooms into :occupancy.
GetRoom (a read) did an unconditional ZAddNX into :occupancy (added in #659 to backfill rooms predating the occupancy feature). But GetRoom is also called for rooms being/already deleted (late pings, in-flight operations), so it re-created an occupancy-only entry (score 0) that no path removes — GetAllRoomIDs only iterates :status, and ping-expiry/pod-gone cleanup only act on status-tracked rooms. Removed the write: a read must be side-effect free. CreateRoom already establishes the occupancy entry, and UpdateRoom uses ZAddXXCh precisely to avoid resurrecting deleted rooms; reads now match. After this, the only occupancy writers are CreateRoom (adds all three sets together) and UpdateRoom (XXCh, update-only) — neither can create an occupancy-only orphan.

2. Make GetRunningMatchesCount immune to any orphan that already exists.
Prevention alone doesn't help schedulers that already leaked, and operators shouldn't have to ZREM orphans by hand. GetRunningMatchesCount now sums occupancy scores only for rooms still present in :status (a running match only lives in a tracked room), via a read-only Lua script. This is self-correcting: stale occupancy members are ignored automatically, so the autoscaler and the running_matches metric stay correct even before/without any cleanup.

Net: #1 stops the leak at the source; #2 makes the autoscaler correct regardless of existing orphans — no manual reconciliation required.

Tests: added a regression test that GetRoom on a missing room creates no occupancy entry; updated the running-matches storage tests to register rooms in :status and added a case asserting an orphan occupancy entry is not counted. All internal/adapters/storage/redis/room (integration) and internal/core/services/autoscaler/... (unit) tests pass.

Does this close any currently open issues?

No tracked issue; full root-cause analysis with file:line references is in the description and commits.

Any relevant logs, error output, etc?

While stuck: scheduler <x> can't downscale, occupation is above the threshold (emitted every cycle with 0 occupied rooms).

Any other comments?

The GetRoom ZAddNX regression was introduced in #659 (commit 7590c7fa, first released v10.11.4, 2025-02-28); present through current main. With #2 in place, the historical orphan entries become harmless on deploy; cleaning them from Redis is optional hygiene, no longer required for correctness.

lucasmonjewildlife and others added 2 commits June 16, 2026 15:33
…cy set

GetRoom did an unconditional ZAddNX into scheduler:<name>:occupancy (added in
#659 to backfill rooms predating the occupancy feature). But GetRoom is also
called for rooms being/already deleted (late pings, in-flight operations), so it
re-created an occupancy-only entry that no cleanup removes: GetAllRoomIDs only
iterates :status, and ping-expiry/pod-gone deletion only act on status-tracked
rooms. Those orphans inflate GetRunningMatchesCount forever, which pins the
roomOccupancy autoscaler's CanDownscale gate and stalls major-version rollouts.

A read must be side-effect free. CreateRoom already establishes the occupancy
entry on creation, and UpdateRoom uses ZAddXXCh precisely to avoid resurrecting
deleted rooms; reads now behave the same. Adds a regression test asserting
GetRoom on a missing room does not create an occupancy member.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ries

GetRunningMatchesCount summed the whole :occupancy zset, so any orphan there
(a member with no matching :status entry) inflated the count, which feeds the
roomOccupancy autoscaler's CanDownscale gate and the running_matches metric. The
GetRoom fix stops creating orphans, but it doesn't help schedulers that already
accumulated them, and operators shouldn't have to ZREM them by hand.

Count only occupancy scores of rooms that still exist in :status (a running
match only lives in a tracked room), via a read-only Lua script. This is
self-correcting: stale occupancy members are ignored automatically, so the
autoscaler and metric stay correct even before any cleanup. Updates the storage
tests to register rooms in :status and adds a regression test asserting an
orphan occupancy entry is not counted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lucasmonje lucasmonje changed the title fix(redis): stop GetRoom from resurrecting deleted rooms into the occupancy set fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune Jun 16, 2026
lucasmonjewildlife and others added 2 commits June 16, 2026 16:07
The first version called ZSCORE once per occupancy member to test status
membership, so the Lua script held Redis's single thread for O(N) redis.call
invocations. Load the status members into a Lua set with one ZRANGE, then test
membership in-memory — same result, 2 redis.call total regardless of N.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lower-bound the occupancy range at (0 so Redis returns only members with a
running match, instead of walking every member (ready rooms and the zero-score
graveyard contribute nothing to the sum). The walk is now proportional to the
number of rooms actually running matches, independent of the occupancy set size.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov-commenter

codecov-commenter commented Jun 16, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.93%. Comparing base (7cb6e86) to head (cf46a2e).
⚠️ Report is 3 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #721      +/-   ##
==========================================
- Coverage   60.93%   60.93%   -0.01%     
==========================================
  Files         144      144              
  Lines       12986    12985       -1     
==========================================
- Hits         7913     7912       -1     
  Misses       4719     4719              
  Partials      354      354              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fbgava-wildlife fbgava-wildlife left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lucasmonje lucasmonje merged commit fa972c6 into main Jun 16, 2026
6 checks passed
@lucasmonje lucasmonje deleted the fix/getroom-occupancy-resurrection branch June 16, 2026 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants