Skip to content

Commit bcc3edd

Browse files
ychanclaude
andcommitted
test(auto-eval): determinism gate also asserts on ef_cqs_strict
Address review comment #2 (Important) on PR #764 Sub-project A. The original determinism gate only tracked lenient ef_cqs deltas between back-to-back runs. Since Sub-project A's whole purpose is to make ef_cqs_strict the future decision gate, the determinism gate should cover it now — not after the cut-over PR, when hidden nondeterminism in the strict denominator path would surface for the first time. Changes: - Track both lenient and strict deltas per ticker in parallel. - Compute max_delta as max(max_lenient, max_strict) and assert against the single DETERMINISM_THRESHOLD (both must be bit-identical). - Log both columns separately so the CI run captures which field (if either) drifted, making root-causing faster. - Failure message reports both maxes and per-ticker pairs. Lenient and strict share an ef_pass_count numerator, so under current determinism they should co-move exactly. But pinning both now catches any future FP-reduction or iteration-order bug that affects only the strict path's wider denominator (total - disputed - unverified). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e977da4 commit bcc3edd

File tree

1 file changed

+35
-11
lines changed

1 file changed

+35
-11
lines changed

tests/xbrl/standardization/test_determinism.py

Lines changed: 35 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,14 @@ def test_extraction_is_deterministic():
3838
Ground-truth assertion: every ticker in DETERMINISM_TEST_COHORT must have a
3939
per-company EF-CQS delta below DETERMINISM_THRESHOLD across two back-to-back
4040
runs of compute_cqs with snapshot_mode=True (no network, no clock dependence).
41+
42+
Asserts on BOTH the lenient ``ef_cqs`` (current decision gate) and the
43+
observation-only ``ef_cqs_strict`` (future decision gate after the Sub-
44+
project A cut-over). Lenient and strict share an ef_pass_count numerator
45+
and should co-move, but pinning both pre-cutover guarantees the strict
46+
field is also bit-identical — otherwise the first strict-mode run after
47+
the gate flip could surprise us with nondeterminism that was hidden
48+
behind the laundering denominator.
4149
"""
4250
result_a = compute_cqs(eval_cohort=DETERMINISM_TEST_COHORT, snapshot_mode=True)
4351
result_b = compute_cqs(eval_cohort=DETERMINISM_TEST_COHORT, snapshot_mode=True)
@@ -49,28 +57,44 @@ def test_extraction_is_deterministic():
4957
f"Run B: {sorted(result_b.company_scores.keys())}."
5058
)
5159

60+
# Track lenient and strict deltas in parallel. Either field going
61+
# non-deterministic fails the gate.
5262
deltas = []
5363
for ticker in DETERMINISM_TEST_COHORT:
54-
a = result_a.company_scores[ticker].ef_cqs
55-
b = result_b.company_scores[ticker].ef_cqs
56-
deltas.append((ticker, abs(a - b), a, b))
64+
ca = result_a.company_scores[ticker]
65+
cb = result_b.company_scores[ticker]
66+
lenient_delta = abs(ca.ef_cqs - cb.ef_cqs)
67+
strict_delta = abs(ca.ef_cqs_strict - cb.ef_cqs_strict)
68+
deltas.append(
69+
(ticker, lenient_delta, strict_delta, ca.ef_cqs, cb.ef_cqs,
70+
ca.ef_cqs_strict, cb.ef_cqs_strict)
71+
)
5772

58-
max_delta = max(d[1] for d in deltas)
73+
max_lenient_delta = max(d[1] for d in deltas)
74+
max_strict_delta = max(d[2] for d in deltas)
75+
max_delta = max(max_lenient_delta, max_strict_delta)
5976

6077
# Print observed deltas so the noise floor is captured in CI logs.
6178
# This output is what informs the DETERMINISM_THRESHOLD constant.
62-
print(f"\n[determinism] max per-company EF-CQS delta: {max_delta:.10f}")
79+
print(f"\n[determinism] max per-company ef_cqs delta: {max_lenient_delta:.10f}")
80+
print(f"[determinism] max per-company ef_cqs_strict delta: {max_strict_delta:.10f}")
6381
print(f"[determinism] threshold: {DETERMINISM_THRESHOLD:.10f}")
64-
for ticker, delta, a, b in sorted(deltas, key=lambda d: -d[1]):
65-
marker = " <-- max" if delta == max_delta else ""
66-
print(f"[determinism] {ticker:<6} delta={delta:.10f} a={a:.6f} b={b:.6f}{marker}")
82+
for ticker, ld, sd, la, lb, sa, sb in sorted(deltas, key=lambda d: -(max(d[1], d[2]))):
83+
marker = " <-- max" if max(ld, sd) == max_delta else ""
84+
print(
85+
f"[determinism] {ticker:<6} "
86+
f"ef_cqs Δ={ld:.10f} (a={la:.6f} b={lb:.6f}) "
87+
f"strict Δ={sd:.10f} (a={sa:.6f} b={sb:.6f}){marker}"
88+
)
6789

6890
assert max_delta < DETERMINISM_THRESHOLD, (
69-
f"Determinism check failed: max per-company EF-CQS delta {max_delta:.6f} "
70-
f"exceeds threshold {DETERMINISM_THRESHOLD:.6f}. "
91+
f"Determinism check failed: max per-company delta {max_delta:.6f} "
92+
f"exceeds threshold {DETERMINISM_THRESHOLD:.6f} "
93+
f"(lenient max={max_lenient_delta:.6f}, strict max={max_strict_delta:.6f}). "
7194
f"Back-to-back runs with identical config produced different scores. "
7295
f"Either fix the determinism bug (FactQuery ordering, dict iteration, "
7396
f"FP reduction order) OR set EDGAR_DETERMINISM_DEGRADED=1 to widen the "
7497
f"chokepoint decision threshold from 0.005 to 0.01. "
75-
f"Per-ticker deltas: {[(t, round(d, 10)) for t, d, _, _ in deltas]}"
98+
f"Per-ticker deltas (ticker, lenient, strict): "
99+
f"{[(t, round(ld, 10), round(sd, 10)) for t, ld, sd, *_ in deltas]}"
76100
)

0 commit comments

Comments
 (0)