fix(sync): schedule cyclic dependency SCCs instead of deadlocking (#1125) by Serhan-Asad · Pull Request #1130 · promptdriven/pdd

Serhan-Asad · 2026-05-21T21:48:56Z

Summary

Detect strongly connected components in the sync target graph and treat intra-SCC dep edges as soft so legitimate dependency cycles (e.g. agentic_checkup_orchestrator <-> checkup_review_loop) no longer deadlock pdd sync with the misleading Succeeded: []. Skipped (not run): [...] message.
Schedule one cycle member at a time (serialize within an SCC); cross-SCC consumers wait for the entire upstream SCC, so a downstream module never starts before all cycle peers finish.
Rewrite _get_blocked_modules over the SCC condensation (a DAG) so failures propagate through cycles correctly and the DFS can never cache an incorrect False on cycle re-entry.

Closes

Closes #1125

What changed

pdd/sync_order.py: new compute_sccs() (iterative Tarjan's), deterministic ordering.
pdd/agentic_sync_runner.py: AsyncSyncRunner.__init__ builds _scc_of + _cyclic_sccs over the basename-induced subgraph; _get_ready_modules applies soft intra-SCC edges, serializes SCC execution, and waits for the whole upstream SCC; _get_blocked_modules walks the condensation.
DurableSyncRunner (subclass) inherits the fix without changes — verified by smoke test.

Test plan

Reproduce the exact graph from bug: pdd sync dependency cycles leave all modules pending #1125 — runner now returns (True, "All 3 modules synced successfully", 0.0) (was (False, "Succeeded: []. Skipped (not run): [...]", 0.0)).
New TestCycleScheduling covers: 2-cycle ready ordering, running peer serializes, success unblocks peer, failed peer blocks peer, 3-module cycle with external dependent, acyclic regression, self-loop, self-loop + external dep, external consumer waits for whole SCC, external consumer blocked when cycle peer fails, run() succeeds, run() with one failure surfaces Skipped (blocked) (never Skipped (not run)).
New test_compute_sccs_deterministic_regardless_of_dep_order.
All existing TestGetReadyModules and TestGetBlockedModules tests pass unchanged.
Full repo test suite re-run: no new failures in test_agentic_sync_runner.py or test_sync_order.py (one pre-existing test_extract_module_known_languages_comprehensive lisp test remains, unrelated to this PR).

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Serhan-Asad · 2026-05-21T22:15:45Z

Addressed the review in dfac9a2. Summary:

Finding 1 — _get_ready_modules now walks the union of every cycle peer's cross-SCC deps. With a→b, b→[a,x], x→[], the cycle now correctly waits for x to succeed before either cycle member starts. Intra-SCC edges remain soft.

Finding 2 — Same change makes ready/blocked mutually exclusive in the cycle-peer-external-failure case. With x failed, _get_ready_modules() == [] (was ['a']) and _get_blocked_modules() == ['a', 'b'].

Finding 3 — PDD source-of-truth updated:

prompts/sync_order_python.prompt documents the new compute_sccs requirement.
prompts/agentic_sync_runner_python.prompt depends on sync_order_python.prompt and rewrites the dependency-scheduling requirement to describe SCC-aware semantics (soft intra-SCC, whole-upstream-SCC gating, serialized cycle execution, condensation-based blocked classification).
architecture.json adds compute_sccs to sync_order's interface and sync_order_python.prompt to agentic_sync_runner's deps.

Also tightened the compute_sccs ordering docstring per a codex NIT — Tarjan emits deps first under the project's A depends on B edge convention.

Two new regression tests:

test_cycle_member_waits_for_peers_external_dep (Finding 1)
test_cycle_member_blocked_by_peers_external_failed_dep (Finding 2)

All 14 cycle-scheduling + 201 runner/sync_order tests pass; the one remaining test_sync_order.py failure (test_extract_module_known_languages_comprehensive, lisp) is unrelated and pre-existing on main.

Serhan-Asad · 2026-05-21T22:35:18Z

Pushed c762da5 addressing review findings 2 and 3:

Finding 2 (architecture.json churn) — reverted the previous edit (which re-serialized the file via json.dump and escaped UTF-8 em-dashes to — while losing the trailing newline). Re-applied only the intended changes via surgical Edits. Final diff vs main is now 26 lines, all in the two targeted entries; UTF-8 chars and trailing newline preserved.

Finding 3 (missing module_targets) — added module_targets: Optional[Dict[str, str]] = None to the AsyncSyncRunner signature in architecture.json. The parameter was already in the prompt header and the code; only the architecture interface metadata was stale. I also extended the sideEffects entry to mention the new SCC scheduling so the metadata reflects the runner's actual behaviour.

Finding 1 (auto-heal failing) — investigated. The error prompt_path prompts/generated_python.prompt set but missing on disk post-update comes from pdd/ci_drift_heal.py:1271 and is triggered inside Cloud Build (the GHA log only surfaces the wait-for-Cloud-Build failure). I cannot find any reference to prompts/generated_python.prompt in the repo at this PR head — no such file exists, no architecture entry, no prompt include. The same heal and auto-heal checks are currently red on PRs #1121, #1126, and #1131 (#1131 is still in-progress as of this comment), all on unrelated changes. That makes auto-heal look like a pre-existing CI infrastructure issue, not something this PR introduced. Happy to dig further into the Cloud Build logs if someone with GCP access can share them, but I don't think there's anything to fix on this branch.

Serhan-Asad · 2026-05-21T23:47:56Z

Cloud-test verification on `c762da538` (PR head)

Ran make cloud-test against the latest PR commit from a clean worktree.

Result: 77 passed, 0 failed, 0 errors (of 77 tasks)

Job: pdd-test-run-20260521-161855-c762da538 + ...-std
Commit: c762da538
77 passed, 0 failed, 0 errors (of 77 tasks)

Breakdown

31 pytest chunks
22 regression cases
15 sync_regression cases
8 cloud_regression cases
1 vitest frontend suite
1 STANDARD-job sync_regression case (immune to SPOT preemption)

Notes

First run had 1 error on chunk_16: task killed before completion (likely SPOT preemption) — infrastructure preemption, not a test failure. Re-running make cloud-test-quick on the same commit was clean (Cloud Build image rebuild was skipped since the image-deps hash didn't change). The job IDs are linked above for traceability.
The pre-existing local test_extract_module_known_languages_comprehensive failure (lisp language detection in tests/test_sync_order.py) also passes in Cloud Batch — that one is environment-specific to my local checkout.

What this validates beyond the SCC scheduler

compute_sccs + the new SCC scheduling paths exercised via every sync_regression case (15 cases) and the full agentic_sync_runner test surface inside the pytest chunks.
Architecture metadata updates (sync_order interface + agentic_sync_runner deps + module_targets) survive the full pytest + regression suite.
Prompt-side updates (sync_order_python.prompt new compute_sccs requirement + agentic_sync_runner_python.prompt SCC-aware requirement 4) don't break any prompt-consuming tests.

Serhan-Asad · 2026-05-22T00:00:12Z

Independent verification — ship-ready

Verified PR head c762da538 end-to-end in an isolated worktree (/private/tmp/pdd-1125-work, fresh Python 3.13 venv). Zero regressions introduced by this PR.

Evidence

Check	Result
Bug repro on main	Reproduced exactly — `(False, "Succeeded: []. Skipped (not run): [...]", 0.0)`
Fix verify on PR head	`(True, "All 3 modules synced successfully", 0.0)`
Full filtered suite at PR head	8809 passed, 59 failed (44 unique paths), 1554s
Same 44 failing paths re-run on main	44 failed — same count → zero regressions
Targeted regression (10 test files importing changed modules)	1207 passed, 1 skip, 1 fail (pre-existing lisp test, also fails on main)
Lint `pylint -E -F pdd/sync_order.py pdd/agentic_sync_runner.py`	10.00/10
DurableSyncRunner inheritance	Safe — overrides `__init__`/`run()` but both call `super()`; doesn't override `_get_ready_modules`/`_get_blocked_modules`

Extra blind-spot tests beyond the diff (all pass)

1000-node cycle through compute_sccs — iterative Tarjan handles it even under sys.setrecursionlimit(200). Runner constructs and serializes correctly.
Two disjoint cycles in one target set — each SCC serializes independently, downstream waits for the right SCC.
Cycle + parallel acyclic chain with a failed cycle member — independent acyclic chain still completes (partial-progress invariant preserved). Message correctly reports Skipped (blocked): not Skipped (not run):.
DurableSyncRunner constructed with a cycle — _scc_of / _cyclic_sccs inherited correctly, scheduling uses SCC logic.

Orchestration end-to-end (Phase E)

Drove run_global_sync() in-process with real AsyncSyncRunner + cyclic dep graph, mocking only _sync_one_module (avoids LLM cost). Exercises the full path: architecture parse → analyze → build_dep_graph → AsyncSyncRunner construction → runner.run(). 13/13 checks pass on both the success path and the partial-progress-on-failure path.

A subprocess pdd sync invocation wouldn't add value here — the click layer between argv and run_global_sync doesn't touch dep graphs or scheduling, and a real subprocess would either need LLM calls or the same _sync_one_module patch (which can't cross a subprocess boundary).

Flip-check (sanity)

Temporarily neutralized the SCC init in AsyncSyncRunner.__init__ while keeping the rest of the fix in place. The repro deadlocks again. Confirms the SCC init is load-bearing — unit tests aren't passing for unrelated reasons.

Auto-heal CI failure — infrastructure, not this PR

Discriminating evidence: PR #1133 fails the same heal/auto-heal Cloud Build step without touching architecture.json or any .prompt file. If the failure were caused by this PR's architecture/prompt edits, #1133 couldn't fail the same way. Flag to infra; should not block merge.

Reproducer scripts

Retained in /private/tmp/:

pdd-1125-verify-repro.py — issue-1125 graph repro
pdd-1125-verify-blindspots.py — algorithm blind-spot tests
pdd-1125-verify-e2e.py — orchestration end-to-end

Serhan-Asad · 2026-05-22T00:19:39Z

Note on cycle-member pick order

Worth flagging since it came up in review discussion: the runner picks the first cycle member in basenames order. This is deterministic but arbitrary — within a strongly-connected component, no module is semantically "before" another, but the same input always picks the same module first.

Does the choice matter?

Success path: no. All orderings produce the same final state.
Failure path: slightly. Example with 3-cycle [A, B, C] where B is buggy:
- Pick A first → A succeeds → try B → B fails → C blocked. Got 1 done.
- Pick B first → B fails → A, C both blocked. Got 0 done.

So picking the failing one early wastes a healthy peer's chance. But you can't predict which will fail.

Could it be smarter?

Parallel cycle execution — abandoned to avoid races on shared state.
Treat the cycle as one mega-module, sync together — most principled fix, but a large engine rewrite.
Heuristics (smallest first, least-recently-modified first) — marginal, fragile.

Bottom line: for the motivating case (both cycle members healthy, re-syncing after a prompt change), order doesn't matter. For failures, worst case wastes one module-sync per re-run; user re-runs, fingerprints preserve partial progress, done.

This PR fixes "silent deadlock → works." Failure-recovery optimization is a different scope. The pick not being smart is fine; the pick being deterministic is what matters — reproducible bugs, predictable CI.

One tiny follow-up worth doing: add a regression test that basenames order within a cycle is deterministic from _dependency_ordered_modules (not subject to dict-iteration order on dep_graph). The existing test_compute_sccs_deterministic_regardless_of_dep_order covers the SCC computation; this would cover the orchestration step that feeds it.

) AsyncSyncRunner._get_ready_modules required every dep in the target set to be success before a module could run. With a legitimate dependency cycle (e.g. agentic_checkup_orchestrator <-> checkup_review_loop using Python late imports), no cycle member ever satisfies that gate, so the runner exited with the misleading "Succeeded: []. Skipped (not run): [...]" message without dispatching any work. Compute strongly connected components of the basename-induced dep graph and apply soft-edge semantics inside each cyclic SCC: - Inside a cyclic SCC (size > 1, or a 1-node SCC with a self-loop), intra-SCC edges no longer block readiness; only a failed peer does. - At most one member of a cyclic SCC runs at a time (serialize cycle execution to avoid races on shared include content). - Cross-SCC consumers wait for the whole upstream SCC, not just the named dep, so a downstream module never starts before all cycle peers finish, and a failed peer cannot silently advance a consumer. _get_blocked_modules now walks the SCC condensation (a DAG), so failed status propagates through cycles correctly. The per-module DFS could previously cache an incorrect False on cycle re-entry, leaving cycle peers and external consumers wrongly unblocked. Adds compute_sccs (iterative Tarjan with deterministic ordering) to sync_order. The DurableSyncRunner subclass inherits the fix transparently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address PR review findings on top of the initial SCC scheduling commit: 1. `_get_ready_modules` now treats the SCC's effective external deps as the union of every member's cross-SCC deps. With graph `a->b, b->[a,x], x->[]`, a cycle member `a` previously became ready while its peer's external dep `x` was still pending (and could even schedule when `x` had already failed). Intra-SCC edges remain soft. 2. The whole-SCC dep walk also fixes the ready/blocked inconsistency: when `x` is failed, `_get_ready_modules` no longer returns the cycle members while `_get_blocked_modules` flags them. 3. Update the PDD source-of-truth so a future regeneration preserves the fix: - `prompts/sync_order_python.prompt`: document the new `compute_sccs(graph) -> List[List[str]]` requirement. - `prompts/agentic_sync_runner_python.prompt`: depend on `sync_order_python.prompt`; rewrite the dependency-scheduling requirement to describe SCC-aware soft intra-SCC edges, whole upstream SCC gating, serialized cycle execution, and condensation- based blocked classification. - `architecture.json`: add `compute_sccs` to sync_order's interface and `sync_order_python.prompt` to agentic_sync_runner's deps. 4. Tighten the `compute_sccs` ordering docstring — Tarjan emits deps-first under the project's `A depends on B` edge convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Revert the previous architecture.json edit (which re-serialized the whole file and escaped UTF-8 em-dashes to \u sequences while losing the trailing newline), then re-apply only the three intended fields via surgical edits: - Add `compute_sccs` to `sync_order_python.prompt` interface. - Add `sync_order_python.prompt` to `agentic_sync_runner_python.prompt` dependencies. - Add `module_targets: Optional[Dict[str, str]] = None` to the `AsyncSyncRunner` signature (it was already in the prompt and the code, but the architecture interface metadata was stale). Also extend the AsyncSyncRunner sideEffects entry to mention the new SCC scheduling so the metadata reflects the runner's actual behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Serhan-Asad and others added 3 commits May 21, 2026 21:43

Serhan-Asad force-pushed the fix/issue-1125 branch from c762da5 to f7b9368 Compare May 22, 2026 04:44

Merge branch 'main' into fix/issue-1125

871a709

gltanaka merged commit 6df9c2d into main May 22, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): schedule cyclic dependency SCCs instead of deadlocking (#1125)#1130

fix(sync): schedule cyclic dependency SCCs instead of deadlocking (#1125)#1130
gltanaka merged 4 commits into
mainfrom
fix/issue-1125

Serhan-Asad commented May 21, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

Serhan-Asad commented May 22, 2026

Uh oh!

Serhan-Asad commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Serhan-Asad commented May 21, 2026

Summary

Closes

What changed

Test plan

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

Serhan-Asad commented May 21, 2026

Uh oh!

Serhan-Asad commented May 21, 2026

Cloud-test verification on c762da538 (PR head)

Breakdown

Notes

What this validates beyond the SCC scheduler

Uh oh!

Serhan-Asad commented May 22, 2026

Independent verification — ship-ready

Evidence

Extra blind-spot tests beyond the diff (all pass)

Orchestration end-to-end (Phase E)

Flip-check (sanity)

Auto-heal CI failure — infrastructure, not this PR

Reproducer scripts

Uh oh!

Serhan-Asad commented May 22, 2026

Note on cycle-member pick order

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cloud-test verification on `c762da538` (PR head)