fix(sync): schedule cyclic dependency SCCs instead of deadlocking (#1125)#1130
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Addressed the review in dfac9a2. Summary: Finding 1 — Finding 2 — Same change makes ready/blocked mutually exclusive in the cycle-peer-external-failure case. With Finding 3 — PDD source-of-truth updated:
Also tightened the Two new regression tests:
All 14 cycle-scheduling + 201 runner/sync_order tests pass; the one remaining |
|
Pushed c762da5 addressing review findings 2 and 3: Finding 2 (architecture.json churn) — reverted the previous edit (which re-serialized the file via Finding 3 (missing module_targets) — added Finding 1 (auto-heal failing) — investigated. The error |
Cloud-test verification on
|
Independent verification — ship-readyVerified PR head Evidence
Extra blind-spot tests beyond the diff (all pass)
Orchestration end-to-end (Phase E)Drove A subprocess Flip-check (sanity)Temporarily neutralized the SCC init in Auto-heal CI failure — infrastructure, not this PRDiscriminating evidence: PR #1133 fails the same Reproducer scriptsRetained in
|
Note on cycle-member pick orderWorth flagging since it came up in review discussion: the runner picks the first cycle member in Does the choice matter?
So picking the failing one early wastes a healthy peer's chance. But you can't predict which will fail. Could it be smarter?
Bottom line: for the motivating case (both cycle members healthy, re-syncing after a prompt change), order doesn't matter. For failures, worst case wastes one module-sync per re-run; user re-runs, fingerprints preserve partial progress, done. This PR fixes "silent deadlock → works." Failure-recovery optimization is a different scope. The pick not being smart is fine; the pick being deterministic is what matters — reproducible bugs, predictable CI. One tiny follow-up worth doing: add a regression test that |
) AsyncSyncRunner._get_ready_modules required every dep in the target set to be success before a module could run. With a legitimate dependency cycle (e.g. agentic_checkup_orchestrator <-> checkup_review_loop using Python late imports), no cycle member ever satisfies that gate, so the runner exited with the misleading "Succeeded: []. Skipped (not run): [...]" message without dispatching any work. Compute strongly connected components of the basename-induced dep graph and apply soft-edge semantics inside each cyclic SCC: - Inside a cyclic SCC (size > 1, or a 1-node SCC with a self-loop), intra-SCC edges no longer block readiness; only a failed peer does. - At most one member of a cyclic SCC runs at a time (serialize cycle execution to avoid races on shared include content). - Cross-SCC consumers wait for the whole upstream SCC, not just the named dep, so a downstream module never starts before all cycle peers finish, and a failed peer cannot silently advance a consumer. _get_blocked_modules now walks the SCC condensation (a DAG), so failed status propagates through cycles correctly. The per-module DFS could previously cache an incorrect False on cycle re-entry, leaving cycle peers and external consumers wrongly unblocked. Adds compute_sccs (iterative Tarjan with deterministic ordering) to sync_order. The DurableSyncRunner subclass inherits the fix transparently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review findings on top of the initial SCC scheduling commit:
1. `_get_ready_modules` now treats the SCC's effective external deps as
the union of every member's cross-SCC deps. With graph `a->b, b->[a,x],
x->[]`, a cycle member `a` previously became ready while its peer's
external dep `x` was still pending (and could even schedule when `x`
had already failed). Intra-SCC edges remain soft.
2. The whole-SCC dep walk also fixes the ready/blocked inconsistency:
when `x` is failed, `_get_ready_modules` no longer returns the cycle
members while `_get_blocked_modules` flags them.
3. Update the PDD source-of-truth so a future regeneration preserves
the fix:
- `prompts/sync_order_python.prompt`: document the new
`compute_sccs(graph) -> List[List[str]]` requirement.
- `prompts/agentic_sync_runner_python.prompt`: depend on
`sync_order_python.prompt`; rewrite the dependency-scheduling
requirement to describe SCC-aware soft intra-SCC edges, whole
upstream SCC gating, serialized cycle execution, and condensation-
based blocked classification.
- `architecture.json`: add `compute_sccs` to sync_order's interface
and `sync_order_python.prompt` to agentic_sync_runner's deps.
4. Tighten the `compute_sccs` ordering docstring — Tarjan emits
deps-first under the project's `A depends on B` edge convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert the previous architecture.json edit (which re-serialized the whole file and escaped UTF-8 em-dashes to \u sequences while losing the trailing newline), then re-apply only the three intended fields via surgical edits: - Add `compute_sccs` to `sync_order_python.prompt` interface. - Add `sync_order_python.prompt` to `agentic_sync_runner_python.prompt` dependencies. - Add `module_targets: Optional[Dict[str, str]] = None` to the `AsyncSyncRunner` signature (it was already in the prompt and the code, but the architecture interface metadata was stale). Also extend the AsyncSyncRunner sideEffects entry to mention the new SCC scheduling so the metadata reflects the runner's actual behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c762da5 to
f7b9368
Compare
Summary
agentic_checkup_orchestrator <-> checkup_review_loop) no longer deadlockpdd syncwith the misleadingSucceeded: []. Skipped (not run): [...]message._get_blocked_modulesover the SCC condensation (a DAG) so failures propagate through cycles correctly and the DFS can never cache an incorrectFalseon cycle re-entry.Closes
Closes #1125
What changed
pdd/sync_order.py: newcompute_sccs()(iterative Tarjan's), deterministic ordering.pdd/agentic_sync_runner.py:AsyncSyncRunner.__init__builds_scc_of+_cyclic_sccsover the basename-induced subgraph;_get_ready_modulesapplies soft intra-SCC edges, serializes SCC execution, and waits for the whole upstream SCC;_get_blocked_moduleswalks the condensation.DurableSyncRunner(subclass) inherits the fix without changes — verified by smoke test.Test plan
(True, "All 3 modules synced successfully", 0.0)(was(False, "Succeeded: []. Skipped (not run): [...]", 0.0)).TestCycleSchedulingcovers: 2-cycle ready ordering, running peer serializes, success unblocks peer, failed peer blocks peer, 3-module cycle with external dependent, acyclic regression, self-loop, self-loop + external dep, external consumer waits for whole SCC, external consumer blocked when cycle peer fails, run() succeeds, run() with one failure surfacesSkipped (blocked)(neverSkipped (not run)).test_compute_sccs_deterministic_regardless_of_dep_order.TestGetReadyModulesandTestGetBlockedModulestests pass unchanged.test_agentic_sync_runner.pyortest_sync_order.py(one pre-existingtest_extract_module_known_languages_comprehensivelisp test remains, unrelated to this PR).