Skip to content

test(rules,storage): pin CPM alert + recording-rule race semantics#320

Open
vporoshok wants to merge 1 commit intoppfrom
tests/cpm-recording-rule-race
Open

test(rules,storage): pin CPM alert + recording-rule race semantics#320
vporoshok wants to merge 1 commit intoppfrom
tests/cpm-recording-rule-race

Conversation

@vporoshok
Copy link
Copy Markdown
Collaborator

Summary

Pin a set of regression / semantics tests motivated by an investigation into a stuck D8ControlPlaneManagerPodNotRunning alert in production (prompp v0.7.6). The expression for that alert is:

max by (node) (
  kube_node_role{role="master"}
  unless
  kube_node_role{role="master"} * on (node) group_left ()
    ((kube_pod_status_ready{condition="true"} == 1) * on (pod, namespace) group_right ()
      kube_controller_pod{controller_name="d8-control-plane-manager",
                          controller_type="DaemonSet", namespace="kube-system"})
)

The investigation showed:

  • kube_controller_pod is a recording rule generated in a different group than the alert.
  • That group has the same hash(name) % interval offset as the alert group (≈6.3 s after every minute boundary in the affected cluster), so both Group.Run goroutines start their Eval within ~19 ms of each other on every tick.
  • prompp's PromQL engine is configured with --query.lookback-delta=1m and both groups have interval: 1m. With ingest of the recording-rule sample taking >19 ms (1215 series in that group), the alert eval queries storage before the new sample becomes visible. Latest visible sample is then ~60.021 s old → out of the 60 s lookback → INNER becomes empty → unless cancels nothing → alert fires for ALL three masters and stays firing until offsets drift apart.
  • The same expression queried via /api/v1/query (which runs later) sees the new sample already committed and returns empty, hence the user-visible mismatch between ALERTS{...}=firing and the API result.

The PR contains no production-code changes. It only adds tests that pin down the expected behaviour:

  • pp-pkg/storage/batch_storage_test.go::TestEphemeralHead_IsSterileBetweenEvaluationsAdapter.BatchStorage() returns a fresh TransactionHead and never leaks series across Eval iterations or to the main head before Commit.

  • pp-pkg/rules/alerting_stuck_test.goAlertingRule.Eval state-machine invariants:

    • empty result moves Firing → Inactive in a single iteration,
    • matching result keeps Firing,
    • inactive alerts are dropped from r.active after resolvedRetention.
  • pp-pkg/rules/control_plane_expr_test.go — the actual CPM expression against upstream promqltest.LoadedStorage: happy path, single-master pod gone, recovery, normal pod replacement.

  • pp-pkg/storage/adapter_promql_test.go — same scenarios against the real prompp Adapter (cppbridge head + querier), plus:

    • head-rotation pod-replacement,
    • recording-rule lag,
    • recording-rule recovery (no sticky storage state),
    • missed-iterations every 2nd cycle,
    • TestCPM_OffsetCollisionRace_FixedByQueryOffset — reproduces the production race exactly: same offset for both groups, alert eval on the tick where the recording-rule sample is "not yet committed". The test asserts:
      • without query_offset: alert fires for all 3 masters (race reproduces),
      • with query_offset = 30s: alert is silent (race window is closed by looking 30 s back),
      • with query_offset = 1s: race already does not fire (any offset > the eval-jitter is enough).

Test plan

  • go test -tags stringlabels ./pp-pkg/rules/... ./pp-pkg/storage/... inside the ARM devcontainer — all new tests pass on origin/pp HEAD.
  • CI green.

Notes for the reader

  • These tests are pure pinning. They do not change Group.Eval, BatchStorage, the engine, or any production logic.
  • The actual prod fix is one of:
    1. set global.rule_query_offset: 30s (or per-group query_offset) — single-line config change, deterministic;
    2. wrap the recording-rule reference in last_over_time(...[5m]) in the alert expression (Deckhouse-rules side);
    3. introduce inter-group dependency-aware scheduling in pp-pkg/rules/Manager so that an alert group never starts its Eval before the recording-rule groups it depends on have committed (larger architectural change, separate PR).

Made with Cursor

Add a set of pinning tests motivated by an investigation into a stuck
D8ControlPlaneManagerPodNotRunning alert in production. The tests fix the
expected PromQL semantics under several stress patterns relevant to that
alert and to similar cross-group recording-rule → alert dependencies in
general:

- pp-pkg/storage/batch_storage_test.go: TestEphemeralHead_IsSterileBetween
  Evaluations — each Adapter.BatchStorage() call returns a brand-new
  TransactionHead and never leaks series from a previous Eval iteration or
  to the main head before Commit.

- pp-pkg/rules/alerting_stuck_test.go: AlertingRule.Eval state-machine
  invariants — empty result transitions Firing → Inactive in one step,
  fingerprinted result keeps the alert Firing, and after resolvedRetention
  inactive alerts are dropped from r.active.

- pp-pkg/rules/control_plane_expr_test.go: the actual CPM expression
  against upstream promqltest.LoadedStorage — happy path, single-master
  pod gone, recovery, normal pod replacement.

- pp-pkg/storage/adapter_promql_test.go: same scenarios against the real
  prompp Adapter (cppbridge head + querier) plus head-rotation, recording-
  rule lag, recovery, missed-iterations, and most importantly TestCPM_
  OffsetCollisionRace_FixedByQueryOffset — reproduces the production race
  where two groups (alert + recording rule) share the same hash-derived
  eval offset and the alert eval queries storage before the recording-rule
  sample becomes visible. Verifies that adding query_offset > the ingest
  jitter window deterministically clears the alert.

No production code is changed. These are pure regression / semantics
guards.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant