Skip to content

fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain#1040

Open
cchristous wants to merge 3 commits into
semaphoreio:mainfrom
cchristous:cchristous/fix-status-422-rate-limit-drain
Open

fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain#1040
cchristous wants to merge 3 commits into
semaphoreio:mainfrom
cchristous:cchristous/fix-status-422-rate-limit-drain

Conversation

@cchristous
Copy link
Copy Markdown
Contributor

@cchristous cchristous commented May 26, 2026

Summary

Fixes a long-standing GitHub-API rate-limit drain caused by POST /repos/:owner/:repo/statuses/:sha retrying forever against SHA+context combinations that have already reached GitHub's documented 1000-statuses-per-SHA limit.

In one self-hosted deployment we observed this single bug burning ~1,400 requests / 2 days on a single stuck SHA — ≈26% of the staging GitHub App's hourly budget. GitHub audit-log analysis confirmed 1,397 of 1,400 status posts were all to the same :repository_id/statuses/:sha, all returning 422.

What was broken

Two issues compounded:

  1. is_max_statuses_response? matcher mismatched the real response shape. GitHub returns the validation error with errors as a string (see @real_422_body fixture):

    {
      "errors": "Validation failed: This SHA and context has reached the maximum number of statuses.",
      "message": "Validation Failed",
      "status": "422"
    }

    fetch_errors/1 expected errors to be a list of objects, so it returned [] and is_max_statuses_response? always answered false. Every max-statuses 422 was therefore treated as a fatal :precondition error and propagated upstream — where the caller (in this case github_notifier) retries with a fresh ppl_id, so the dedup cache key changes and the next pipeline run does the same 422 dance.

  2. No caller-side prevention. Even with the matcher working, every new pipeline run still hits GitHub. The 422 response costs a rate-limit unit just like a 201, so the matcher recognising it after the fact didn't save any budget.

The matcher has been broken since the very first OSS commitgit log -L:'is_max_statuses_response?':repository_hub/lib/repository_hub/clients/github_client.ex confirms it has never been modified.

What this PR does

repository_hub/:

  • Rewrites is_max_statuses_response? to gate on message == "Validation Failed" and substring-match "maximum number of statuses" against either a string or a list of error objects (errors_as_text/1).
  • Adds RepositoryHub.MaxStatusesCache — a per-pod ETS-backed GenServer keyed by {owner, repo, sha, context} with a 24h TTL. Before every create_build_status call we consult the cache; on a hit we short-circuit and return wrap(%{}) — no POST, no rate-limit pre-flight, no rate-limit unit consumed. After detecting a true max-statuses 422 we populate the cache.
  • Moves the 422 branch into its own handle_422_create_build_status/2 clause so the max-statuses matcher cannot inadvertently fire on 307/404/403 responses.
  • Adds a canary log_warn when a 422 carries message: "Validation Failed" but the canonical substring isn't found — this will surface in logs the moment GitHub rewords the error, so we can update the matcher before the rate-limit drain bug silently regresses.
  • Adds observability:
    • github_api.create_build_status.max_statuses_cache_hit Watchman counter
    • github_api.max_statuses_cache.table_missing Watchman counter + Logger.warning (the ETS table being absent is treated as a soft failure that callers should still survive, but it's now alertable instead of silent)
    • Includes repo_owner/repo_name@commit_sha ctx=… in the existing log_error lines.

Why this is safe

  • Behaviour on success (HTTP 201) is unchanged.
  • Behaviour on non-max-statuses 422 / 403 / 404 / 307 is unchanged — those still log and fail_with(:precondition, …). Only the specific "maximum number of statuses" 422 is treated as success and cached. New tests cover 403/404 with a misleading body to lock this down.
  • wrap(%{}) short-circuit shape matches the existing max-statuses success path — the sole caller (adapters/github/create_build_status_action.ex) only pattern-matches {:ok, _} and discards the payload.
  • The cache is in-process ETS — no new external dependency, no new failure modes. Each pod owns its own cache; worst case after a pod restart is one wasted request per maxed SHA+context before the cache is repopulated.
  • 24h TTL — stale entries naturally fall out; nothing pins memory for inactive SHAs forever.
  • :ets.whereis/1 is checked before every read/write — if the GenServer is down, callers degrade to "no cache" (the original behaviour, with observability) instead of crashing.

Tests

12 new tests across two files. All pre-existing github_client_test.exs (8 tests) and create_build_status_action_test.exs (4 tests) continue to pass.

New tests cover:

  • The current string-form errors shape (the real bug repro).
  • The legacy list-of-objects errors shape (defensive — module supports both).
  • A 422 with message: "Validation Failed" but an unrelated errors body is NOT cached and still surfaces as :precondition.
  • 403 and 404 in the same case-branch as 422 are NEVER put through the matcher and never populate the cache, even with a body that contains the canonical max-statuses substring.
  • After the first 422, the second call short-circuits before both Tentacat.Repositories.Statuses.create/5 and the Tentacat.get("rate_limit", _) pre-flight — the whole point of the fix is to spare every rate-limit unit.
  • Cache hit/miss for all four key components (owner, repo, sha, context).
  • Expired TTL entries are treated as misses.

Test plan

  • mix test test/repository_hub/clients/{max_statuses_cache,github_client_max_statuses,github_client}_test.exs test/repository_hub/adapters/github/create_build_status_action_test.exs — 30 tests, 0 failures.
  • mix format --check-formatted
  • mix compile --warnings-as-errors
  • (vendor CI) full repository_hub suite once this PR opens. Note: the Policy::MixAudit check flags pre-existing CVEs in cowboy, hackney, and decimal in mix.lock — unrelated to this change.

…ate limit

When a SHA+context combination reaches GitHub's 1000-statuses-per-SHA limit,
every additional `POST /repos/:owner/:repo/statuses/:sha` is rejected with
422 and the request still costs a rate-limit unit. The previous code tried
to detect this and silently succeed, but had two issues that combined to
burn ~1,400 wasted requests per 2 days on a single stuck SHA at one site:

1. `is_max_statuses_response?` matcher mismatched the real response shape.
   GitHub returns the `errors` field as a *string*:

       "errors": "Validation failed: This SHA and context has reached the
                  maximum number of statuses."

   The matcher expected a list of error objects, so `fetch_errors/1` returned
   `[]` and the check always failed. Every 422 was propagated as a fatal
   `:precondition` error to callers instead of being treated as already-done.

2. Even with the matcher working, no caller-side cache existed. Each new
   pipeline run for the same SHA still made the POST, since cache layers
   upstream (e.g. github_notifier's Cachex) key by pipeline id and so miss
   for every new run.

This change introduces a small per-pod ETS-backed cache,
`RepositoryHub.MaxStatusesCache`, keyed by `(owner, repo, sha, context)`
with a 24h TTL. Before every `create_build_status` call we consult the
cache and short-circuit if the combo is already known to be maxed. After
detecting max-statuses we populate the cache. The matcher is rewritten
to handle both the string and list error formats by substring-matching
the canonical phrase "maximum number of statuses".

Why this is safe:

- Behaviour on success (HTTP 201) is unchanged.
- Behaviour on non-max-statuses 422 / 403 / 404 / 307 is unchanged — those
  still log and `fail_with(:precondition, ...)`. Only the specific
  "maximum number of statuses" case is treated as success and cached.
- The cache is in-process ETS — no new external dependency. Each pod owns
  its own cache; worst case after a pod restart is one wasted request per
  maxed SHA+context before the cache is repopulated.
- 24h TTL means stale entries naturally fall out; nothing pins memory
  for inactive SHAs forever.
- The cache module degrades silently (returns `false` from `maxed?/4`)
  if the ETS table is absent, so tests that don't start the application
  supervisor are not broken.

Tests:

- `MaxStatusesCacheTest` covers hit / miss / distinct-key behaviour.
- `GithubClientMaxStatusesTest` mocks `Tentacat.Repositories.Statuses.create`
  to return the real GitHub 422 response shape and verifies:
  - the matcher accepts the response and `create_build_status` returns
    `{:ok, _}`;
  - a second call short-circuits and never invokes Tentacat;
  - a 422 for an unrelated reason still fails and is NOT cached.

All 22 pre-existing GithubClient + adapter tests continue to pass.
Addresses the consensus findings from review:

Production code:
- `MaxStatusesCache.maxed?/4` and `mark_maxed/4` now check
  `:ets.whereis(@table)` explicitly instead of relying on a broad
  `rescue ArgumentError`. When the table is absent (GenServer down or
  restarting), we emit `Watchman.increment` and a `Logger.warning` so
  the degraded state is alertable instead of silent.
- Gate `is_max_statuses_response?` on the actual 422 status code via a
  dedicated `handle_422_create_build_status/2` clause. Previously the
  matcher ran for 307/404/403 too, which made it theoretically possible
  to cache an unrelated response (extremely unlikely in practice, but
  the intent was muddled).
- Emit a canary `log_warn` when a 422 carries `message: "Validation
  Failed"` but the canonical max-statuses substring is not found —
  i.e. when GitHub appears to have reworded the error. This converts a
  silent regression (the matcher stops working, the rate-limit drain
  returns) into a visible log line.
- Include `repo_owner/repo_name@commit_sha ctx=...` in the `log_error`
  output so the lines are diagnosable months from now.
- Add a `Watchman` counter on cache-hit short-circuits so the
  effectiveness of the cache is measurable.

Tests:
- Recognise the legacy list-of-objects `errors` shape.
- Assert that the cache short-circuit also skips the `Tentacat.get
  ("rate_limit", _)` pre-flight, not just the create POST — that is
  the entire point of the fix.
- Assert that a 403 / 404 in the same caller branch is NEVER cached,
  even with a body that would otherwise look like max-statuses.
- Assert that expired entries are treated as cache misses.
- Cover all four key components (owner, repo, sha, context) for
  cache-key isolation.
- Replace the brittle `status: 9` magic-number assertion with a
  message-substring assertion.
- Filter `:meck.history` by `:create` (and the new rate-limit count
  helper by `:get` with `"rate_limit"`) so the counts are precise.

All 30 tests in the touched modules pass.
Replace the production-captured `confluentinc/semaphore-test` repo
identifiers and the matching SHA with generic placeholders. The fixture
was originally lifted from a real audit-log entry; only the body shape
and `errors` string matter for the regression test.
@cchristous cchristous marked this pull request as ready for review May 26, 2026 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant