fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain by cchristous · Pull Request #1040 · semaphoreio/semaphore

cchristous · 2026-05-26T18:13:52Z

Summary

Fixes a long-standing GitHub-API rate-limit drain caused by POST /repos/:owner/:repo/statuses/:sha retrying forever against SHA+context combinations that have already reached GitHub's documented 1000-statuses-per-SHA limit.

In one self-hosted deployment we observed this single bug burning ~1,400 requests / 2 days on a single stuck SHA — ≈26% of the staging GitHub App's hourly budget. GitHub audit-log analysis confirmed 1,397 of 1,400 status posts were all to the same :repository_id/statuses/:sha, all returning 422.

What was broken

Two issues compounded:

is_max_statuses_response? matcher mismatched the real response shape. GitHub returns the validation error with errors as a string (see @real_422_body fixture):
```
{
  "errors": "Validation failed: This SHA and context has reached the maximum number of statuses.",
  "message": "Validation Failed",
  "status": "422"
}
```
fetch_errors/1 expected errors to be a list of objects, so it returned [] and is_max_statuses_response? always answered false. Every max-statuses 422 was therefore treated as a fatal :precondition error and propagated upstream — where the caller (in this case github_notifier) retries with a fresh ppl_id, so the dedup cache key changes and the next pipeline run does the same 422 dance.
No caller-side prevention. Even with the matcher working, every new pipeline run still hits GitHub. The 422 response costs a rate-limit unit just like a 201, so the matcher recognising it after the fact didn't save any budget.

The matcher has been broken since the very first OSS commit — git log -L:'is_max_statuses_response?':repository_hub/lib/repository_hub/clients/github_client.ex confirms it has never been modified.

What this PR does

repository_hub/:

Rewrites is_max_statuses_response? to gate on message == "Validation Failed" and substring-match "maximum number of statuses" against either a string or a list of error objects (errors_as_text/1).
Adds RepositoryHub.MaxStatusesCache — a per-pod ETS-backed GenServer keyed by {owner, repo, sha, context} with a 24h TTL. Before every create_build_status call we consult the cache; on a hit we short-circuit and return wrap(%{}) — no POST, no rate-limit pre-flight, no rate-limit unit consumed. After detecting a true max-statuses 422 we populate the cache.
Moves the 422 branch into its own handle_422_create_build_status/2 clause so the max-statuses matcher cannot inadvertently fire on 307/404/403 responses.
Adds a canary log_warn when a 422 carries message: "Validation Failed" but the canonical substring isn't found — this will surface in logs the moment GitHub rewords the error, so we can update the matcher before the rate-limit drain bug silently regresses.
Adds observability:
- github_api.create_build_status.max_statuses_cache_hit Watchman counter
- github_api.max_statuses_cache.table_missing Watchman counter + Logger.warning (the ETS table being absent is treated as a soft failure that callers should still survive, but it's now alertable instead of silent)
- Includes repo_owner/repo_name@commit_sha ctx=… in the existing log_error lines.

Why this is safe

Behaviour on success (HTTP 201) is unchanged.
Behaviour on non-max-statuses 422 / 403 / 404 / 307 is unchanged — those still log and fail_with(:precondition, …). Only the specific "maximum number of statuses" 422 is treated as success and cached. New tests cover 403/404 with a misleading body to lock this down.
wrap(%{}) short-circuit shape matches the existing max-statuses success path — the sole caller (adapters/github/create_build_status_action.ex) only pattern-matches {:ok, _} and discards the payload.
The cache is in-process ETS — no new external dependency, no new failure modes. Each pod owns its own cache; worst case after a pod restart is one wasted request per maxed SHA+context before the cache is repopulated.
24h TTL — stale entries naturally fall out; nothing pins memory for inactive SHAs forever.
:ets.whereis/1 is checked before every read/write — if the GenServer is down, callers degrade to "no cache" (the original behaviour, with observability) instead of crashing.

Tests

12 new tests across two files. All pre-existing github_client_test.exs (8 tests) and create_build_status_action_test.exs (4 tests) continue to pass.

New tests cover:

The current string-form errors shape (the real bug repro).
The legacy list-of-objects errors shape (defensive — module supports both).
A 422 with message: "Validation Failed" but an unrelated errors body is NOT cached and still surfaces as :precondition.
403 and 404 in the same case-branch as 422 are NEVER put through the matcher and never populate the cache, even with a body that contains the canonical max-statuses substring.
After the first 422, the second call short-circuits before both Tentacat.Repositories.Statuses.create/5 and the Tentacat.get("rate_limit", _) pre-flight — the whole point of the fix is to spare every rate-limit unit.
Cache hit/miss for all four key components (owner, repo, sha, context).
Expired TTL entries are treated as misses.

Test plan

mix test test/repository_hub/clients/{max_statuses_cache,github_client_max_statuses,github_client}_test.exs test/repository_hub/adapters/github/create_build_status_action_test.exs — 30 tests, 0 failures.
mix format --check-formatted
mix compile --warnings-as-errors
(vendor CI) full repository_hub suite once this PR opens. Note: the Policy::MixAudit check flags pre-existing CVEs in cowboy, hackney, and decimal in mix.lock — unrelated to this change.

…ate limit When a SHA+context combination reaches GitHub's 1000-statuses-per-SHA limit, every additional `POST /repos/:owner/:repo/statuses/:sha` is rejected with 422 and the request still costs a rate-limit unit. The previous code tried to detect this and silently succeed, but had two issues that combined to burn ~1,400 wasted requests per 2 days on a single stuck SHA at one site: 1. `is_max_statuses_response?` matcher mismatched the real response shape. GitHub returns the `errors` field as a *string*: "errors": "Validation failed: This SHA and context has reached the maximum number of statuses." The matcher expected a list of error objects, so `fetch_errors/1` returned `[]` and the check always failed. Every 422 was propagated as a fatal `:precondition` error to callers instead of being treated as already-done. 2. Even with the matcher working, no caller-side cache existed. Each new pipeline run for the same SHA still made the POST, since cache layers upstream (e.g. github_notifier's Cachex) key by pipeline id and so miss for every new run. This change introduces a small per-pod ETS-backed cache, `RepositoryHub.MaxStatusesCache`, keyed by `(owner, repo, sha, context)` with a 24h TTL. Before every `create_build_status` call we consult the cache and short-circuit if the combo is already known to be maxed. After detecting max-statuses we populate the cache. The matcher is rewritten to handle both the string and list error formats by substring-matching the canonical phrase "maximum number of statuses". Why this is safe: - Behaviour on success (HTTP 201) is unchanged. - Behaviour on non-max-statuses 422 / 403 / 404 / 307 is unchanged — those still log and `fail_with(:precondition, ...)`. Only the specific "maximum number of statuses" case is treated as success and cached. - The cache is in-process ETS — no new external dependency. Each pod owns its own cache; worst case after a pod restart is one wasted request per maxed SHA+context before the cache is repopulated. - 24h TTL means stale entries naturally fall out; nothing pins memory for inactive SHAs forever. - The cache module degrades silently (returns `false` from `maxed?/4`) if the ETS table is absent, so tests that don't start the application supervisor are not broken. Tests: - `MaxStatusesCacheTest` covers hit / miss / distinct-key behaviour. - `GithubClientMaxStatusesTest` mocks `Tentacat.Repositories.Statuses.create` to return the real GitHub 422 response shape and verifies: - the matcher accepts the response and `create_build_status` returns `{:ok, _}`; - a second call short-circuits and never invokes Tentacat; - a 422 for an unrelated reason still fails and is NOT cached. All 22 pre-existing GithubClient + adapter tests continue to pass.

Addresses the consensus findings from review: Production code: - `MaxStatusesCache.maxed?/4` and `mark_maxed/4` now check `:ets.whereis(@table)` explicitly instead of relying on a broad `rescue ArgumentError`. When the table is absent (GenServer down or restarting), we emit `Watchman.increment` and a `Logger.warning` so the degraded state is alertable instead of silent. - Gate `is_max_statuses_response?` on the actual 422 status code via a dedicated `handle_422_create_build_status/2` clause. Previously the matcher ran for 307/404/403 too, which made it theoretically possible to cache an unrelated response (extremely unlikely in practice, but the intent was muddled). - Emit a canary `log_warn` when a 422 carries `message: "Validation Failed"` but the canonical max-statuses substring is not found — i.e. when GitHub appears to have reworded the error. This converts a silent regression (the matcher stops working, the rate-limit drain returns) into a visible log line. - Include `repo_owner/repo_name@commit_sha ctx=...` in the `log_error` output so the lines are diagnosable months from now. - Add a `Watchman` counter on cache-hit short-circuits so the effectiveness of the cache is measurable. Tests: - Recognise the legacy list-of-objects `errors` shape. - Assert that the cache short-circuit also skips the `Tentacat.get ("rate_limit", _)` pre-flight, not just the create POST — that is the entire point of the fix. - Assert that a 403 / 404 in the same caller branch is NEVER cached, even with a body that would otherwise look like max-statuses. - Assert that expired entries are treated as cache misses. - Cover all four key components (owner, repo, sha, context) for cache-key isolation. - Replace the brittle `status: 9` magic-number assertion with a message-substring assertion. - Filter `:meck.history` by `:create` (and the new rate-limit count helper by `:get` with `"rate_limit"`) so the counts are precise. All 30 tests in the touched modules pass.

Replace the production-captured `confluentinc/semaphore-test` repo identifiers and the matching SHA with generic placeholders. The fixture was originally lifted from a real audit-log entry; only the body shape and `errors` string matter for the regression test.

cchristous added 2 commits May 26, 2026 13:32

github-project-automation Bot added this to Roadmap May 26, 2026

github-project-automation Bot moved this to Backlog in Roadmap May 26, 2026

This was referenced May 26, 2026

fix(github_hooks): skip git/refs lookup in BranchPayload when caller supplied a SHA #1041

Open

fix(github_hooks): make create_ref idempotent, drop probe-then-create in ensure_ref #1042

Open

cchristous marked this pull request as ready for review May 26, 2026 20:27

cchristous requested review from DamjanBecirovic, dexyk, loadez and skipi as code owners May 26, 2026 20:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain#1040

fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain#1040
cchristous wants to merge 3 commits into
semaphoreio:mainfrom
cchristous:cchristous/fix-status-422-rate-limit-drain

cchristous commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cchristous commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What was broken

What this PR does

Why this is safe

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cchristous commented May 26, 2026 •

edited

Loading