fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain#1040
Open
cchristous wants to merge 3 commits into
Open
fix(repository_hub): cache GitHub max-statuses 422 to stop rate-limit drain#1040cchristous wants to merge 3 commits into
cchristous wants to merge 3 commits into
Conversation
…ate limit
When a SHA+context combination reaches GitHub's 1000-statuses-per-SHA limit,
every additional `POST /repos/:owner/:repo/statuses/:sha` is rejected with
422 and the request still costs a rate-limit unit. The previous code tried
to detect this and silently succeed, but had two issues that combined to
burn ~1,400 wasted requests per 2 days on a single stuck SHA at one site:
1. `is_max_statuses_response?` matcher mismatched the real response shape.
GitHub returns the `errors` field as a *string*:
"errors": "Validation failed: This SHA and context has reached the
maximum number of statuses."
The matcher expected a list of error objects, so `fetch_errors/1` returned
`[]` and the check always failed. Every 422 was propagated as a fatal
`:precondition` error to callers instead of being treated as already-done.
2. Even with the matcher working, no caller-side cache existed. Each new
pipeline run for the same SHA still made the POST, since cache layers
upstream (e.g. github_notifier's Cachex) key by pipeline id and so miss
for every new run.
This change introduces a small per-pod ETS-backed cache,
`RepositoryHub.MaxStatusesCache`, keyed by `(owner, repo, sha, context)`
with a 24h TTL. Before every `create_build_status` call we consult the
cache and short-circuit if the combo is already known to be maxed. After
detecting max-statuses we populate the cache. The matcher is rewritten
to handle both the string and list error formats by substring-matching
the canonical phrase "maximum number of statuses".
Why this is safe:
- Behaviour on success (HTTP 201) is unchanged.
- Behaviour on non-max-statuses 422 / 403 / 404 / 307 is unchanged — those
still log and `fail_with(:precondition, ...)`. Only the specific
"maximum number of statuses" case is treated as success and cached.
- The cache is in-process ETS — no new external dependency. Each pod owns
its own cache; worst case after a pod restart is one wasted request per
maxed SHA+context before the cache is repopulated.
- 24h TTL means stale entries naturally fall out; nothing pins memory
for inactive SHAs forever.
- The cache module degrades silently (returns `false` from `maxed?/4`)
if the ETS table is absent, so tests that don't start the application
supervisor are not broken.
Tests:
- `MaxStatusesCacheTest` covers hit / miss / distinct-key behaviour.
- `GithubClientMaxStatusesTest` mocks `Tentacat.Repositories.Statuses.create`
to return the real GitHub 422 response shape and verifies:
- the matcher accepts the response and `create_build_status` returns
`{:ok, _}`;
- a second call short-circuits and never invokes Tentacat;
- a 422 for an unrelated reason still fails and is NOT cached.
All 22 pre-existing GithubClient + adapter tests continue to pass.
Addresses the consensus findings from review: Production code: - `MaxStatusesCache.maxed?/4` and `mark_maxed/4` now check `:ets.whereis(@table)` explicitly instead of relying on a broad `rescue ArgumentError`. When the table is absent (GenServer down or restarting), we emit `Watchman.increment` and a `Logger.warning` so the degraded state is alertable instead of silent. - Gate `is_max_statuses_response?` on the actual 422 status code via a dedicated `handle_422_create_build_status/2` clause. Previously the matcher ran for 307/404/403 too, which made it theoretically possible to cache an unrelated response (extremely unlikely in practice, but the intent was muddled). - Emit a canary `log_warn` when a 422 carries `message: "Validation Failed"` but the canonical max-statuses substring is not found — i.e. when GitHub appears to have reworded the error. This converts a silent regression (the matcher stops working, the rate-limit drain returns) into a visible log line. - Include `repo_owner/repo_name@commit_sha ctx=...` in the `log_error` output so the lines are diagnosable months from now. - Add a `Watchman` counter on cache-hit short-circuits so the effectiveness of the cache is measurable. Tests: - Recognise the legacy list-of-objects `errors` shape. - Assert that the cache short-circuit also skips the `Tentacat.get ("rate_limit", _)` pre-flight, not just the create POST — that is the entire point of the fix. - Assert that a 403 / 404 in the same caller branch is NEVER cached, even with a body that would otherwise look like max-statuses. - Assert that expired entries are treated as cache misses. - Cover all four key components (owner, repo, sha, context) for cache-key isolation. - Replace the brittle `status: 9` magic-number assertion with a message-substring assertion. - Filter `:meck.history` by `:create` (and the new rate-limit count helper by `:get` with `"rate_limit"`) so the counts are precise. All 30 tests in the touched modules pass.
Replace the production-captured `confluentinc/semaphore-test` repo identifiers and the matching SHA with generic placeholders. The fixture was originally lifted from a real audit-log entry; only the body shape and `errors` string matter for the regression test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a long-standing GitHub-API rate-limit drain caused by
POST /repos/:owner/:repo/statuses/:sharetrying forever against SHA+context combinations that have already reached GitHub's documented 1000-statuses-per-SHA limit.In one self-hosted deployment we observed this single bug burning ~1,400 requests / 2 days on a single stuck SHA — ≈26% of the staging GitHub App's hourly budget. GitHub audit-log analysis confirmed 1,397 of 1,400 status posts were all to the same
:repository_id/statuses/:sha, all returning 422.What was broken
Two issues compounded:
is_max_statuses_response?matcher mismatched the real response shape. GitHub returns the validation error witherrorsas a string (see@real_422_bodyfixture):{ "errors": "Validation failed: This SHA and context has reached the maximum number of statuses.", "message": "Validation Failed", "status": "422" }fetch_errors/1expectederrorsto be a list of objects, so it returned[]andis_max_statuses_response?always answeredfalse. Every max-statuses 422 was therefore treated as a fatal:preconditionerror and propagated upstream — where the caller (in this casegithub_notifier) retries with a freshppl_id, so the dedup cache key changes and the next pipeline run does the same 422 dance.No caller-side prevention. Even with the matcher working, every new pipeline run still hits GitHub. The 422 response costs a rate-limit unit just like a 201, so the matcher recognising it after the fact didn't save any budget.
The matcher has been broken since the very first OSS commit —
git log -L:'is_max_statuses_response?':repository_hub/lib/repository_hub/clients/github_client.exconfirms it has never been modified.What this PR does
repository_hub/:is_max_statuses_response?to gate onmessage == "Validation Failed"and substring-match"maximum number of statuses"against either a string or a list of error objects (errors_as_text/1).RepositoryHub.MaxStatusesCache— a per-pod ETS-backed GenServer keyed by{owner, repo, sha, context}with a 24h TTL. Before everycreate_build_statuscall we consult the cache; on a hit we short-circuit and returnwrap(%{})— no POST, no rate-limit pre-flight, no rate-limit unit consumed. After detecting a true max-statuses 422 we populate the cache.handle_422_create_build_status/2clause so the max-statuses matcher cannot inadvertently fire on 307/404/403 responses.log_warnwhen a 422 carriesmessage: "Validation Failed"but the canonical substring isn't found — this will surface in logs the moment GitHub rewords the error, so we can update the matcher before the rate-limit drain bug silently regresses.github_api.create_build_status.max_statuses_cache_hitWatchman countergithub_api.max_statuses_cache.table_missingWatchman counter +Logger.warning(the ETS table being absent is treated as a soft failure that callers should still survive, but it's now alertable instead of silent)repo_owner/repo_name@commit_sha ctx=…in the existinglog_errorlines.Why this is safe
fail_with(:precondition, …). Only the specific "maximum number of statuses" 422 is treated as success and cached. New tests cover 403/404 with a misleading body to lock this down.wrap(%{})short-circuit shape matches the existing max-statuses success path — the sole caller (adapters/github/create_build_status_action.ex) only pattern-matches{:ok, _}and discards the payload.:ets.whereis/1is checked before every read/write — if the GenServer is down, callers degrade to "no cache" (the original behaviour, with observability) instead of crashing.Tests
12 new tests across two files. All pre-existing
github_client_test.exs(8 tests) andcreate_build_status_action_test.exs(4 tests) continue to pass.New tests cover:
errorsshape (the real bug repro).errorsshape (defensive — module supports both).message: "Validation Failed"but an unrelatederrorsbody is NOT cached and still surfaces as:precondition.Tentacat.Repositories.Statuses.create/5and theTentacat.get("rate_limit", _)pre-flight — the whole point of the fix is to spare every rate-limit unit.Test plan
mix test test/repository_hub/clients/{max_statuses_cache,github_client_max_statuses,github_client}_test.exs test/repository_hub/adapters/github/create_build_status_action_test.exs— 30 tests, 0 failures.mix format --check-formattedmix compile --warnings-as-errorsPolicy::MixAuditcheck flags pre-existing CVEs incowboy,hackney, anddecimalinmix.lock— unrelated to this change.