regression-canary: break PascalCase FHIR citation parsing#6
Open
tylerxia8 wants to merge 1 commit into
Open
Conversation
Deliberate regression to demonstrate the W2 eval gate catches real breaks per the PRD's hard-gate requirement: "graders will introduce a small regression and confirm your CI gate fails." Change: structural verifier's CITATION_RE was [A-Za-z_]+ (matches both lowercase OpenEMR tables and PascalCase FHIR resource types). This PR drops the uppercase class — only [a-z_]+ — so PascalCase resources stop parsing as citations. Expected eval-gate effect: - extraction_lab citations are DocumentReference#... (PascalCase) -> fail - citation_validity_meds wants MedicationRequest#... -> fail - evidence wants Guideline#... -> fail - golden UC-1 briefing wants Condition#, Encounter#, etc. -> fail - All ~10pp+ drops -> well past the 5pp regression-delta -> CI fails DO NOT MERGE. This PR is the standing demonstration that the W2 eval gate has teeth — it sits open on GitHub as concrete proof of the regression-detection property the PRD requires.
tylerxia8
added a commit
that referenced
this pull request
May 8, 2026
- .env.example: add VOYAGE_API_KEY, COHERE_API_KEY, REDIS_URL.
Used by the hybrid-RAG layer + per-patient context cache;
previously only documented in agent-service/README.md and
W2_ARCHITECTURE.md, not in the root .env.example a setup
reader sees first.
- .gitignore: ignore .oauth-creds-*.txt files generated by
the documented OAuth recovery procedure.
- AUDIT.md §1.5: drop the stale "ARCHITECTURE §11 (Sunday)
tracks completion" line; document the active volume
mitigation (mounted at sites/default/documents/, seeded
from /opt/openemr-documents-template/ on first boot) and
add a 4-step recovery procedure for the key-mismatch
failure mode encountered 2026-05-08 — re-register OAuth
client, enable, swap env vars, verify.
- .github/workflows/eval-gate.yml: pre-flight ping
\${AGENT_URL}/healthz before the 19-min eval run so a
Railway hiccup fails fast with "Staging agent unreachable"
instead of looking like a real regression catch. Also add
a baseline-drift guard that emits a ::warning:: annotation
whenever a PR modifies baseline.json, so silent re-locks
against a regressed agent are surfaced for reviewer ack.
- agent-service/evals/w2/baseline.json: self-documenting
_meta block — purpose, lock timestamp, and rerock checklist
referencing the regression-canary PR #6 and the unit tests
in tests/test_eval_runner.py. The runner only reads
category_rates/rubric_rates, so the new key is ignored;
gate logic unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Deliberate regression to demonstrate the W2 eval gate catches real breaks per the PRD's hard-gate requirement: "graders will introduce a small regression and confirm your CI gate fails."
Change: structural verifier's CITATION_RE was [A-Za-z_]+ (matches both lowercase OpenEMR tables and PascalCase FHIR resource types). This PR drops the uppercase class — only [a-z_]+ — so PascalCase resources stop parsing as citations.
Expected eval-gate effect:
DO NOT MERGE. This PR is the standing demonstration that the W2 eval gate has teeth — it sits open on GitHub as concrete proof of the regression-detection property the PRD requires.