Skip to content

fix(checkup): bounded PR-head freshness lease for pdd checkup --pr (#1116)#1134

Merged
gltanaka merged 10 commits into
mainfrom
fix/issue-1116-pr-head-rerun
May 22, 2026
Merged

fix(checkup): bounded PR-head freshness lease for pdd checkup --pr (#1116)#1134
gltanaka merged 10 commits into
mainfrom
fix/issue-1116-pr-head-rerun

Conversation

@Serhan-Asad
Copy link
Copy Markdown
Collaborator

Summary

Closes #1116. pdd checkup --pr now auto-reruns up to twice when the remote PR head advances mid-checkup, instead of failing closed and forcing operator manual rerun.

  • What changed: rename-and-wrap pattern around run_agentic_checkup_orchestrator adds three freshness checkpoints (A after Step 7 gate, B on rebase-related push failures, C after no-change push outcomes). Each checkpoint refetches PR metadata and restarts the inner orchestrator from scratch when the head moved, capped at MAX_PR_HEAD_REFRESHES = 2. The restart counter persists in a sidecar file outside workflow state so a Ctrl-C + manual resume cannot bypass the bound.
  • What did NOT change: _commit_and_push_if_changed return shape (still (bool, str)), pdd/checkup_review_loop.py, pdd/agentic_e2e_fix_orchestrator.py (_run_final_checkup_on_pr stays fail-closed as the outer safety net), README, CHANGELOG. No force-push behavior added.
  • --no-fix mode stays fail-closed via the existing _finalize SHA-mismatch downgrade; it does NOT consume restart budget.

Why this design (vs PR #1121)

The pre-existing PR #1121 wraps the entire orchestrator in while True:, changes the push helper's return-tuple shape, and touches checkup_review_loop / README / CHANGELOG (2010+/1732- LOC). This PR uses a minimal rename-and-wrap (~1265 lines including 863 lines of tests):

  1. The 1168-line inner body gets a one-line def rename to _run_agentic_checkup_orchestrator_inner.
  2. A new ~130-line public wrapper holds the bounded restart loop and catches an internal _PRHeadAdvancedRestart exception.
  3. Two checkpoints inside the inner raise on stale-head detection.
  4. A third checkpoint guards the no-change-push window between Checkpoint A and the canonical report.

Codex (gpt-5.1-codex) reviewed 4 rounds and signed off (VERDICT: ship).

Commits (4)

Test plan

  • 12 new tests in TestPrHeadAdvanceAutoRerun covering all three checkpoints, budget exhaustion, sidecar persistence, sidecar resume-from-disk, --no-fix no-consume, auth-failure no-trigger, empty-SHA fail-degrade.
  • LC_ALL=C pytest tests/test_agentic_checkup_orchestrator.py tests/test_checkup_pr_mode.py -m "not integration and not e2e and not real" --timeout=60 → 180/180 passing.
  • CI greens on the full suite.
  • Manual smoke: drive a pdd checkup --pr run while an external commit lands on the PR branch mid-run; verify the rerun fires once and a clean second pass completes.

Notes

  • Supersedes PR Make pdd checkup --pr rerun when PR head advances mid-checkup #1121 (over-scoped — leave it for the owner to close or transplant any useful test coverage).
  • Prompt source pdd/prompts/agentic_checkup_orchestrator_python.prompt updated for parity so a future pdd sync cannot regenerate without the fix.
  • Sidecar path is .pdd/checkup-pr-{pr_number}/pr_head_refreshes — outside .pdd/checkup-state so clear_workflow_state never deletes the counter.

🤖 Generated with Claude Code

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Step 7/8: Verification & Final Report (Iteration 1)

Test Results After Fixes

  • Total: 460 tests (checkup-scoped) + 19 (focused new/modified)
  • Passed: 460 (+ 1 skipped) / 19
  • Failed: 0
  • Previously failing, now passing: 0 (no failures in prior steps)
  • New failures: 0

Build Status

Pass — python -m compileall pdd succeeds; package imports cleanly.

Overall Status

All clear — Steps 6.1 (artifact removal), 6.2 (artifact-hygiene regression), 6.3 (freshness-lease integration tests) and the test_agentic_crash.py chdir fix all verified. No stray new_file.txt at repo root.

Issue Alignment

issue_aligned: true — The PR's commit history (feat(checkup): bounded PR-head freshness lease, codex round-1/round-2 fixes, prompt-source sync) substantively implements the bounded PR-head freshness lease requested by #1116; the checkup-iteration changes here (artifact hygiene + cross-module integration tests pinning the 2-tuple contract, sidecar isolation from clear_workflow_state, and the exhaustion-message SHA chain) reinforce the same mechanism without altering its contract.

Issues Summary

Severity Category Module Description Fixed
low artifact repo root Stray empty new_file.txt left by an earlier checkup run Yes
low test_failure tests/test_agentic_crash.py test_file_system_changes_detected could write new_file.txt into the real repo root Yes (chdir(tmp_path) in test)
low test_failure tests No regression test asserted "no stray top-level artifacts" invariant Yes (test_checkup_artifact_hygiene.py)
low interface_mismatch tests No integration coverage of the orchestrator↔review-loop 2-tuple contract and sidecar isolation Yes (test_checkup_freshness_lease_integration.py, 4 tests)

Checkup complete.

All Issues Fixed

{
  "success": true,
  "message": "PR #1134 substantively implements the bounded PR-head freshness lease for `pdd checkup --pr` requested by #1116 (capture-then-recheck PR head SHA, bounded MAX_PR_HEAD_REFRESHES reruns with sidecar counter, rebase-conflict classification, no force-push). Step 7 verification: build passes; 460 checkup-scoped tests pass (1 skipped); 19 focused tests (4 new freshness-lease integration tests + 2 new artifact-hygiene tests + 13 existing/modified agentic_crash tests) pass. Iteration-1 fixes removed a stray `new_file.txt` artifact at repo root, fixed `test_file_system_changes_detected` to chdir into tmp_path so it can't recreate that artifact, and added regression coverage pinning both the no-stray-artifact invariant and the orchestrator↔review-loop 2-tuple contract / sidecar isolation / exhaustion-SHA-chain behavior.",
  "tech_stack": ["python", "pytest", "click", "pydantic"],
  "issue_aligned": true,
  "issues": [
    {
      "module": "repo_root",
      "file": "new_file.txt",
      "severity": "low",
      "category": "artifact",
      "description": "Stray empty new_file.txt left at repo root by an earlier checkup run on a different issue.",
      "fixed": true,
      "fix_description": "Removed the untracked empty file in Step 6.1."
    },
    {
      "module": "tests",
      "file": "tests/test_agentic_crash.py",
      "severity": "low",
      "category": "test_failure",
      "description": "test_file_system_changes_detected used Path.cwd() (real repo root) for the mocked agent side effect that writes new_file.txt, which could leave a stray artifact at the repo root if the test was run from the repo.",
      "fixed": true,
      "fix_description": "Added monkeypatch.chdir(tmp_path) so the mocked side effect writes into the test sandbox; documented the rationale referencing issue #1116."
    },
    {
      "module": "tests",
      "file": "tests/test_checkup_artifact_hygiene.py",
      "severity": "low",
      "category": "test_failure",
      "description": "No regression test pinned the 'no stray new_file.txt / empty top-level .txt artifacts at repo root' invariant.",
      "fixed": true,
      "fix_description": "Added 2 parametrized assertions guarding against stray artifacts at the repo root."
    },
    {
      "module": "tests",
      "file": "tests/test_checkup_freshness_lease_integration.py",
      "severity": "low",
      "category": "interface_mismatch",
      "description": "No cross-module integration coverage of the orchestrator↔review-loop 2-tuple contract for _commit_and_push_if_changed, sidecar refresh-counter isolation from clear_workflow_state, or the budget-exhaustion message's observed-SHA chain.",
      "fixed": true,
      "fix_description": "Added 4 integration tests covering the 2-tuple contract, helper-wrapper equivalence, sidecar isolation from workflow-state-dir clearing, and the SHA-chain content of the exhaustion error message."
    }
  ],
  "changed_files": [
    "tests/test_agentic_crash.py",
    "tests/test_checkup_artifact_hygiene.py",
    "tests/test_checkup_freshness_lease_integration.py"
  ]
}

PR Push Status

Pushed fixes to PR branch.

Serhan-Asad and others added 8 commits May 22, 2026 10:59
Issue #1116. `pdd checkup --pr` previously fail-closed when the remote
PR head advanced during a run and the local fix commit couldn't be
cleanly rebased onto the new head. The operator had to rerun manually.

Now bounded auto-rerun: when the head advances mid-checkup, discard the
stale work and restart the orchestrator against the new head, capped at
MAX_PR_HEAD_REFRESHES=2. Past that budget the wrapper returns a clear
failure listing observed SHA transitions.

Design — rename-and-wrap (minimal blast radius):

- run_agentic_checkup_orchestrator becomes a small public wrapper.
  Non-PR mode is pass-through; PR mode loops around the inner.
- The renamed _run_agentic_checkup_orchestrator_inner gets two
  checkpoints that raise _PRHeadAdvancedRestart on stale-head detection:
  A) after Step 7's gate passes (fix mode), before the push block.
  B) inside `if not push_ok:`, only when the push message indicates a
     rebase failure AND a fresh refetch confirms the head moved (auth/
     network errors never trigger a rerun).
- Counter persists in `.pdd/checkup-pr-{N}/pr_head_refreshes` so a
  Ctrl-C + manual resume can't bypass the bound. Cleared on success.
- --no-fix stays fail-closed (existing _finalize SHA-mismatch logic).
- _commit_and_push_if_changed return shape stays (bool, str).
- _run_final_checkup_on_pr in agentic_e2e_fix_orchestrator.py stays
  fail-closed as the outer safety net.

Tests: 7 new tests under TestPrHeadAdvanceAutoRerun covering both
checkpoints, budget exhaustion, --no-fix no-consume, auth-failure
no-trigger, counter clear-on-success, and counter persistence. All
175 tests in test_agentic_checkup_orchestrator + test_checkup_pr_mode
pass with no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from codex's first review of the PR-head freshness
lease implementation.

1. Checkpoint B classifier missed the case where the push helper's
   3-attempt retry loop exhausts on remote-advance: the final return
   is `Failed to push fixes to PR branch: <git stderr>`, not the
   rebase-specific strings. Extended the classifier to also match
   that generic message when `_is_remote_advanced_push_error`
   recognizes the embedded git markers.

2. Added Checkpoint C: when `_commit_and_push_if_changed` returns
   "No changes to push." / "No eligible changes to push.", neither
   the rebase path nor `_run_post_push_reverify_if_needed` runs. The
   head could have advanced between Checkpoint A and the final
   report. Checkpoint C refetches just before the report and
   restarts if the head moved.

3. Added `test_refresh_counter_resumes_from_pre_seeded_value` which
   pre-seeds the sidecar at 1, forces one restart, and asserts
   exhaustion. Locks down that the wrapper actually loads the
   persisted counter on entry (the existing persistence test only
   verified the file is written).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ssifier

Three important findings + one nit from codex's second review of the
PR-head freshness lease implementation.

1. Outer wrapper now calls clear_workflow_state(...) explicitly when
   catching _PRHeadAdvancedRestart. Previously it relied on the inner's
   identity guard to discard stale cache via SHA mismatch; a flaky
   refetch returning the old SHA could defeat the rerun. The exception
   already proves the head advanced — don't depend on a fresh fetch to
   re-prove it.

2. Checkpoint B classifier extended to recognize "Failed to reset to
   original fixer commit before rebase: ..." from
   _rebase_onto_updated_pr_head. That message is only reachable after
   a remote-advanced push, so when the fresh-refetch SHA confirms
   advance, it should follow the same bounded restart path.

3. Three new tests:
   - Generic non-fast-forward push failure triggers restart via the
     _is_remote_advanced_push_error branch (round-1 added the classifier
     branch but no test exercised it).
   - "No changes to push" head-advance before final report restarts and
     does NOT post a stale Step 7 verdict (Checkpoint C coverage).
   - Empty initial pr_head_sha fail-degrades to current behavior; no
     restart, no budget consumption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex round-3 finding. PDD treats prompts as source of truth; the
generated module pdd/agentic_checkup_orchestrator.py now contains the
pdd/prompts/agentic_checkup_orchestrator_python.prompt described the
pre-#1116 behavior. A future pdd sync would regenerate from the stale
prompt and drop the fix.

Added a sub-bullet under Requirement 7 (PR-mode) describing:
- MAX_PR_HEAD_REFRESHES bound and the three freshness checkpoints
  (A: after Step 7 gate, B: on rebase-related push failure with
  fresh-refetch confirmation, C: after no-change push outcome).
- Rename-and-wrap structure with the internal _PRHeadAdvancedRestart
  exception caught by the outer wrapper.
- Explicit clear_workflow_state on restart so a flaky refetch cannot
  defeat the rerun via the identity guard.
- Sidecar counter at .pdd/checkup-pr-{N}/pr_head_refreshes (outside
  workflow-state dir so clear_workflow_state does not nuke it).
- Budget exhaustion diagnostic listing observed SHA transitions.
- --no-fix mode does not consume restart budget; the lease is fix-mode
  only.

No production code change; tests unchanged (180/180 still passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
F1: --no-fix --pr mode now refetches the remote PR head after Step 7's
gate passes and fails closed (unverified verdict) if the head advanced
during the run. Previously _finalize's fail-closed downgrade only runs
under --review-loop, leaving --no-fix open to publishing a stale clean
verdict.

F2: _PRHeadAdvancedRestart now carries step_comments_set; the outer
wrapper preserves it across restart iterations via _carried_step_comments
so per-step issue comments aren't re-posted after clear_workflow_state
wipes the stored field.

F3: Added _force_skip_state_load kwarg to the inner; set True by the
outer on restart iterations so a flaky GitHub state DELETE cannot reload
stale cached step outputs on the next inner call.

Tests: 3 new test scenarios in TestPrHeadAdvanceAutoRerun.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
F1a: fail-closed when freshness refetch returns empty SHA (unknown
freshness). Previously an exception-swallowed refetch silently let
the clean Step 7 verdict through; now "could not retrieve" triggers
the same downgrade path.

F1b: posted report body now includes the downgrade reason prepended
to the Step 7 output so the PR/issue thread shows WHY the verdict
was invalidated, not just the clean output that no longer applies.

Prompt: documents --no-fix fail-closed behavior (both advance and
unknown freshness), _carried_step_comments, and _force_skip_state_load
so future pdd sync regeneration preserves all three guarantees.

Tests: 2 new tests — unknown-freshness fail-closed, report body
includes "Verdict downgraded" + SHA transition text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
External review filed three gaps on PR #1134 after rebase onto current
main. All three reproduce on the rebased branch and are closed here.

Finding #1 (rebase regression): main added test_scope and
start_step_override params to _run_agentic_checkup_orchestrator_inner
and threads them from agentic_checkup.run_agentic_checkup, but the new
public wrapper run_agentic_checkup_orchestrator was added in this PR
without those kwargs. The rebase merges cleanly but the call from
agentic_checkup.py would TypeError at runtime. Fix: add test_scope and
start_step_override to the wrapper signature; forward to both inner
call sites (non-PR pass-through and PR-mode restart loop).

Finding #2 (Checkpoint A2 gap): the registry-edit and prompt-source
guard refusal paths each post the canonical Step 7 verdict via
_post_pr_mode_final_report before returning. Between Checkpoint A's
refetch (line 1953) and these refusal posts (lines 2298 / 2333) the
remote head can advance, leaving the Step 7 verdict belonging to the
OLD head. Fix: reuse the pr_metadata fetch already at the top of the
guards-and-push block (no extra GitHub I/O) and raise
_PRHeadAdvancedRestart on mismatch — symmetric with Checkpoint A.

Finding #3 (entry-SHA empty gap): the --no-fix post-Step-7 freshness
check at line 1996 was gated on a non-empty current_pr_head_sha. If
the entry _fetch_pr_metadata returned empty (transient gh hiccup) but
the post-run refetch succeeded with a non-empty SHA, the clean Step 7
verdict slipped through as "fresh" even though we had no baseline to
compare against. Fix: add a third elif branch — when current_pr_head_sha
is empty but nofix_fresh_sha is non-empty, fail-closed with a diagnostic
explaining the entry SHA was unavailable.

Tests: 2 new tests in TestPrHeadAdvanceAutoRerun —
test_pr_head_advance_at_guards_block_triggers_rerun (A2 restart with
exact metadata-call accounting) and test_no_fix_empty_entry_sha_fails_closed
(empty entry + non-empty fresh -> downgrade, no budget consumed).
Updated _run_orchestrator_capturing_context in
test_agentic_checkup_orchestrator.py to mock _fetch_pr_metadata so the
new no-fix freshness check sees stable SHAs (previously the gh-CLI
call failed silently in the sandbox, which the pre-rebase no-fix path
did not exercise).

Prompt: updated agentic_checkup_orchestrator_python.prompt with the
Checkpoint A2 bullet and the three-condition --no-fix downgrade so a
future pdd sync regenerates with the fix in place.

LC_ALL=C pytest tests/test_agentic_checkup_orchestrator.py tests/test_checkup_pr_mode.py -m "not integration and not e2e and not real" --timeout=60 -> 204/204 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Serhan-Asad Serhan-Asad force-pushed the fix/issue-1116-pr-head-rerun branch from 31b041c to ecdfde0 Compare May 22, 2026 18:13
Serhan-Asad and others added 2 commits May 22, 2026 11:36
External review filed three follow-up gaps on PR #1134 after round-4:

Finding 1 (successful push paths still have stale-report race): the
round-1 Checkpoint C only fired on "No changes to push." / "No eligible
changes to push.", leaving normal push and post-rebase reverify-passed
paths open. If the remote head advances after our push but before
_post_pr_mode_final_report, the canonical Step 7 verdict belongs to an
older head than what the PR thread now shows.

Fix: broaden to Checkpoint D — fire on every successful push outcome
(failure paths are already covered by Checkpoint B). Baseline is
_git_rev_parse_head(worktree_path) (the local pushed SHA) instead of
current_pr_head_sha — after a real push the remote head moves BECAUSE
of us, so comparing against entry SHA would false-positive every push.
Local HEAD equals our pushed commit on normal/rebase paths and equals
entry SHA on the no-change path, so Checkpoint D subsumes the round-1
Checkpoint C.

Finding 2 (fix-mode disables freshness lease when entry SHA empty):
every fix-mode checkpoint (A, A2, B, D) gates on a non-empty
current_pr_head_sha. If the entry _fetch_pr_metadata returned empty
(transient gh hiccup) the entire lease silently disabled, opening the
same stale-verdict hole the lease was built to close.

Fix: fail-closed at fix-mode PR-mode entry when current_pr_head_sha is
empty. Symmetric with the round-3 --no-fix post-Step-7 fail-closed
downgrade — the user retries once gh recovers.

Finding 3 (README missing user-visible rerun behavior): the new
auto-rerun affects latency/cost/recovery; operators need to recognize
the budget-exhaustion failure surface and know where the sidecar lives.

Fix: README section under PR Mode covering auto-rerun cap=2, sidecar
path (.pdd/checkup-pr-{N}/pr_head_refreshes), budget-exhaustion
diagnostic format, fix-mode entry-SHA fail-closed, and --no-fix
fail-closed semantics.

Tests: 1 new test (test_pr_head_advance_after_successful_real_push_
triggers_rerun) covering Checkpoint D on a real push followed by
third-party advance; existing test_empty_entry_pr_head_sha_does_not_
consume_refresh_budget renamed and rewritten as
test_empty_entry_pr_head_sha_fix_mode_fails_closed for the new
fail-closed semantics. Updated metadata_mock.call_count and
_git_rev_parse_head side_effects in three existing tests because
Checkpoint D adds one metadata fetch per successful push iteration and
expects local HEAD to equal the pushed SHA. Parametrized
test_resume_discards_cache_when_either_side_sha_is_empty switched to
no_fix=True so the new fix-mode entry-SHA gate does not mask the
state-identity guard behavior under test; added a fetch mock to
test_step_8_skipped_in_pr_mode and test_pr_mode_commits_and_pushes_
generated_fixes.

Prompt source: agentic_checkup_orchestrator_python.prompt updated with
Checkpoint D documentation and the fix-mode entry-SHA fail-closed
contract so a future pdd sync regenerates with both guarantees.

LC_ALL=C pytest tests/test_agentic_checkup_orchestrator.py tests/test_checkup_pr_mode.py -m "not integration and not e2e and not real" --timeout=60 -> 205/205 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud-test on the round-5 commit (9dc17b2) passed 77/77 chunks with
no failures. Captured durations refresh the chunk-balancing input so
future runs distribute load against the current PR-mode test surface
(the round-3/4/5 commits added ~200+ new test cases in
test_checkup_pr_mode.py + test_agentic_checkup_orchestrator.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gltanaka gltanaka merged commit 20f7025 into main May 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make pdd checkup --pr rerun when PR head advances mid-checkup

3 participants