test(rules,storage): pin CPM alert + recording-rule race semantics#320
Open
test(rules,storage): pin CPM alert + recording-rule race semantics#320
Conversation
Add a set of pinning tests motivated by an investigation into a stuck D8ControlPlaneManagerPodNotRunning alert in production. The tests fix the expected PromQL semantics under several stress patterns relevant to that alert and to similar cross-group recording-rule → alert dependencies in general: - pp-pkg/storage/batch_storage_test.go: TestEphemeralHead_IsSterileBetween Evaluations — each Adapter.BatchStorage() call returns a brand-new TransactionHead and never leaks series from a previous Eval iteration or to the main head before Commit. - pp-pkg/rules/alerting_stuck_test.go: AlertingRule.Eval state-machine invariants — empty result transitions Firing → Inactive in one step, fingerprinted result keeps the alert Firing, and after resolvedRetention inactive alerts are dropped from r.active. - pp-pkg/rules/control_plane_expr_test.go: the actual CPM expression against upstream promqltest.LoadedStorage — happy path, single-master pod gone, recovery, normal pod replacement. - pp-pkg/storage/adapter_promql_test.go: same scenarios against the real prompp Adapter (cppbridge head + querier) plus head-rotation, recording- rule lag, recovery, missed-iterations, and most importantly TestCPM_ OffsetCollisionRace_FixedByQueryOffset — reproduces the production race where two groups (alert + recording rule) share the same hash-derived eval offset and the alert eval queries storage before the recording-rule sample becomes visible. Verifies that adding query_offset > the ingest jitter window deterministically clears the alert. No production code is changed. These are pure regression / semantics guards. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pin a set of regression / semantics tests motivated by an investigation into a stuck
D8ControlPlaneManagerPodNotRunningalert in production (promppv0.7.6). The expression for that alert is:The investigation showed:
kube_controller_podis a recording rule generated in a different group than the alert.hash(name) % intervaloffset as the alert group (≈6.3 s after every minute boundary in the affected cluster), so bothGroup.Rungoroutines start theirEvalwithin ~19 ms of each other on every tick.--query.lookback-delta=1mand both groups haveinterval: 1m. With ingest of the recording-rule sample taking >19 ms (1215 series in that group), the alert eval queries storage before the new sample becomes visible. Latest visible sample is then ~60.021 s old → out of the 60 s lookback → INNER becomes empty →unlesscancels nothing → alert fires for ALL three masters and stays firing until offsets drift apart./api/v1/query(which runs later) sees the new sample already committed and returns empty, hence the user-visible mismatch betweenALERTS{...}=firingand the API result.The PR contains no production-code changes. It only adds tests that pin down the expected behaviour:
pp-pkg/storage/batch_storage_test.go::TestEphemeralHead_IsSterileBetweenEvaluations—Adapter.BatchStorage()returns a freshTransactionHeadand never leaks series across Eval iterations or to the main head beforeCommit.pp-pkg/rules/alerting_stuck_test.go—AlertingRule.Evalstate-machine invariants:r.activeafterresolvedRetention.pp-pkg/rules/control_plane_expr_test.go— the actual CPM expression against upstreampromqltest.LoadedStorage: happy path, single-master pod gone, recovery, normal pod replacement.pp-pkg/storage/adapter_promql_test.go— same scenarios against the real promppAdapter(cppbridge head + querier), plus:TestCPM_OffsetCollisionRace_FixedByQueryOffset— reproduces the production race exactly: same offset for both groups, alert eval on the tick where the recording-rule sample is "not yet committed". The test asserts:query_offset: alert fires for all 3 masters (race reproduces),query_offset = 30s: alert is silent (race window is closed by looking 30 s back),query_offset = 1s: race already does not fire (any offset > the eval-jitter is enough).Test plan
go test -tags stringlabels ./pp-pkg/rules/... ./pp-pkg/storage/...inside the ARM devcontainer — all new tests pass onorigin/ppHEAD.Notes for the reader
Group.Eval,BatchStorage, the engine, or any production logic.global.rule_query_offset: 30s(or per-groupquery_offset) — single-line config change, deterministic;last_over_time(...[5m])in the alert expression (Deckhouse-rules side);pp-pkg/rules/Managerso that an alert group never starts its Eval before the recording-rule groups it depends on have committed (larger architectural change, separate PR).Made with Cursor