fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune by lucasmonje · Pull Request #721 · topfreegames/maestro

lucasmonje · 2026-06-16T18:34:35Z

What does this implement/fix? Explain your changes.

The roomOccupancy autoscaler input running_matches is the score-sum of scheduler:<name>:occupancy. That set accumulates orphan members that are never removed, so running_matches drifts upward and never returns to reality. Once it sits above the downscale threshold, CanDownscale is permanently false: the scheduler stops downscaling and rolling updates can't drain old-version rooms. Observed in prod: :occupancy had ~2.3k members vs ~5 in :status, running_matches pinned at 410 while the API reported 0 occupied; killing pods and switching versions didn't move it (the state is in Redis, not the pods).

Two commits, prevention + immunity:

1. Stop GetRoom from resurrecting deleted rooms into :occupancy.
GetRoom (a read) did an unconditional ZAddNX into :occupancy (added in #659 to backfill rooms predating the occupancy feature). But GetRoom is also called for rooms being/already deleted (late pings, in-flight operations), so it re-created an occupancy-only entry (score 0) that no path removes — GetAllRoomIDs only iterates :status, and ping-expiry/pod-gone cleanup only act on status-tracked rooms. Removed the write: a read must be side-effect free. CreateRoom already establishes the occupancy entry, and UpdateRoom uses ZAddXXCh precisely to avoid resurrecting deleted rooms; reads now match. After this, the only occupancy writers are CreateRoom (adds all three sets together) and UpdateRoom (XXCh, update-only) — neither can create an occupancy-only orphan.

2. Make GetRunningMatchesCount immune to any orphan that already exists.
Prevention alone doesn't help schedulers that already leaked, and operators shouldn't have to ZREM orphans by hand. GetRunningMatchesCount now sums occupancy scores only for rooms still present in :status (a running match only lives in a tracked room), via a read-only Lua script. This is self-correcting: stale occupancy members are ignored automatically, so the autoscaler and the running_matches metric stay correct even before/without any cleanup.

Net: #1 stops the leak at the source; #2 makes the autoscaler correct regardless of existing orphans — no manual reconciliation required.

Tests: added a regression test that GetRoom on a missing room creates no occupancy entry; updated the running-matches storage tests to register rooms in :status and added a case asserting an orphan occupancy entry is not counted. All internal/adapters/storage/redis/room (integration) and internal/core/services/autoscaler/... (unit) tests pass.

Does this close any currently open issues?

No tracked issue; full root-cause analysis with file:line references is in the description and commits.

Any relevant logs, error output, etc?

While stuck: scheduler <x> can't downscale, occupation is above the threshold (emitted every cycle with 0 occupied rooms).

Any other comments?

The GetRoom ZAddNX regression was introduced in #659 (commit 7590c7fa, first released v10.11.4, 2025-02-28); present through current main. With #2 in place, the historical orphan entries become harmless on deploy; cleaning them from Redis is optional hygiene, no longer required for correctness.

…cy set GetRoom did an unconditional ZAddNX into scheduler:<name>:occupancy (added in #659 to backfill rooms predating the occupancy feature). But GetRoom is also called for rooms being/already deleted (late pings, in-flight operations), so it re-created an occupancy-only entry that no cleanup removes: GetAllRoomIDs only iterates :status, and ping-expiry/pod-gone deletion only act on status-tracked rooms. Those orphans inflate GetRunningMatchesCount forever, which pins the roomOccupancy autoscaler's CanDownscale gate and stalls major-version rollouts. A read must be side-effect free. CreateRoom already establishes the occupancy entry on creation, and UpdateRoom uses ZAddXXCh precisely to avoid resurrecting deleted rooms; reads now behave the same. Adds a regression test asserting GetRoom on a missing room does not create an occupancy member. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ries GetRunningMatchesCount summed the whole :occupancy zset, so any orphan there (a member with no matching :status entry) inflated the count, which feeds the roomOccupancy autoscaler's CanDownscale gate and the running_matches metric. The GetRoom fix stops creating orphans, but it doesn't help schedulers that already accumulated them, and operators shouldn't have to ZREM them by hand. Count only occupancy scores of rooms that still exist in :status (a running match only lives in a tracked room), via a read-only Lua script. This is self-correcting: stale occupancy members are ignored automatically, so the autoscaler and metric stay correct even before any cleanup. Updates the storage tests to register rooms in :status and adds a regression test asserting an orphan occupancy entry is not counted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The first version called ZSCORE once per occupancy member to test status membership, so the Lua script held Redis's single thread for O(N) redis.call invocations. Load the status members into a Lua set with one ZRANGE, then test membership in-memory — same result, 2 redis.call total regardless of N. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lower-bound the occupancy range at (0 so Redis returns only members with a running match, instead of walking every member (ready rooms and the zero-score graveyard contribute nothing to the sum). The walk is now proportional to the number of rooms actually running matches, independent of the occupancy set size. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-06-16T19:15:24Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.93%. Comparing base (7cb6e86) to head (cf46a2e).
⚠️ Report is 3 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #721      +/-   ##
==========================================
- Coverage   60.93%   60.93%   -0.01%     
==========================================
  Files         144      144              
  Lines       12986    12985       -1     
==========================================
- Hits         7913     7912       -1     
  Misses       4719     4719              
  Partials      354      354

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fbgava-wildlife

LGTM

lucasmonjewildlife and others added 2 commits June 16, 2026 15:33

lucasmonje changed the title ~~fix(redis): stop GetRoom from resurrecting deleted rooms into the occupancy set~~ fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune Jun 16, 2026

lucasmonjewildlife and others added 2 commits June 16, 2026 16:07

fbgava-wildlife approved these changes Jun 16, 2026

View reviewed changes

lucasmonje merged commit fa972c6 into main Jun 16, 2026
6 checks passed

lucasmonje deleted the fix/getroom-occupancy-resurrection branch June 16, 2026 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune#721

fix(redis): occupancy leak — stop GetRoom resurrecting rooms, and make running-matches orphan-immune#721
lucasmonje merged 4 commits into
mainfrom
fix/getroom-occupancy-resurrection

lucasmonje commented Jun 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 16, 2026 •

edited

Loading

Uh oh!

fbgava-wildlife left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

lucasmonje commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this implement/fix? Explain your changes.

Does this close any currently open issues?

Any relevant logs, error output, etc?

Any other comments?

Uh oh!

codecov-commenter commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fbgava-wildlife left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucasmonje commented Jun 16, 2026 •

edited

Loading

codecov-commenter commented Jun 16, 2026 •

edited

Loading