Skip to content

fix(sync): schedule cyclic dependency SCCs instead of deadlocking (#1125)#1130

Merged
gltanaka merged 4 commits into
mainfrom
fix/issue-1125
May 22, 2026
Merged

fix(sync): schedule cyclic dependency SCCs instead of deadlocking (#1125)#1130
gltanaka merged 4 commits into
mainfrom
fix/issue-1125

Conversation

@Serhan-Asad
Copy link
Copy Markdown
Collaborator

Summary

  • Detect strongly connected components in the sync target graph and treat intra-SCC dep edges as soft so legitimate dependency cycles (e.g. agentic_checkup_orchestrator <-> checkup_review_loop) no longer deadlock pdd sync with the misleading Succeeded: []. Skipped (not run): [...] message.
  • Schedule one cycle member at a time (serialize within an SCC); cross-SCC consumers wait for the entire upstream SCC, so a downstream module never starts before all cycle peers finish.
  • Rewrite _get_blocked_modules over the SCC condensation (a DAG) so failures propagate through cycles correctly and the DFS can never cache an incorrect False on cycle re-entry.

Closes

Closes #1125

What changed

  • pdd/sync_order.py: new compute_sccs() (iterative Tarjan's), deterministic ordering.
  • pdd/agentic_sync_runner.py: AsyncSyncRunner.__init__ builds _scc_of + _cyclic_sccs over the basename-induced subgraph; _get_ready_modules applies soft intra-SCC edges, serializes SCC execution, and waits for the whole upstream SCC; _get_blocked_modules walks the condensation.
  • DurableSyncRunner (subclass) inherits the fix without changes — verified by smoke test.

Test plan

  • Reproduce the exact graph from bug: pdd sync dependency cycles leave all modules pending #1125 — runner now returns (True, "All 3 modules synced successfully", 0.0) (was (False, "Succeeded: []. Skipped (not run): [...]", 0.0)).
  • New TestCycleScheduling covers: 2-cycle ready ordering, running peer serializes, success unblocks peer, failed peer blocks peer, 3-module cycle with external dependent, acyclic regression, self-loop, self-loop + external dep, external consumer waits for whole SCC, external consumer blocked when cycle peer fails, run() succeeds, run() with one failure surfaces Skipped (blocked) (never Skipped (not run)).
  • New test_compute_sccs_deterministic_regardless_of_dep_order.
  • All existing TestGetReadyModules and TestGetBlockedModules tests pass unchanged.
  • Full repo test suite re-run: no new failures in test_agentic_sync_runner.py or test_sync_order.py (one pre-existing test_extract_module_known_languages_comprehensive lisp test remains, unrelated to this PR).

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Addressed the review in dfac9a2. Summary:

Finding 1_get_ready_modules now walks the union of every cycle peer's cross-SCC deps. With a→b, b→[a,x], x→[], the cycle now correctly waits for x to succeed before either cycle member starts. Intra-SCC edges remain soft.

Finding 2 — Same change makes ready/blocked mutually exclusive in the cycle-peer-external-failure case. With x failed, _get_ready_modules() == [] (was ['a']) and _get_blocked_modules() == ['a', 'b'].

Finding 3 — PDD source-of-truth updated:

  • prompts/sync_order_python.prompt documents the new compute_sccs requirement.
  • prompts/agentic_sync_runner_python.prompt depends on sync_order_python.prompt and rewrites the dependency-scheduling requirement to describe SCC-aware semantics (soft intra-SCC, whole-upstream-SCC gating, serialized cycle execution, condensation-based blocked classification).
  • architecture.json adds compute_sccs to sync_order's interface and sync_order_python.prompt to agentic_sync_runner's deps.

Also tightened the compute_sccs ordering docstring per a codex NIT — Tarjan emits deps first under the project's A depends on B edge convention.

Two new regression tests:

  • test_cycle_member_waits_for_peers_external_dep (Finding 1)
  • test_cycle_member_blocked_by_peers_external_failed_dep (Finding 2)

All 14 cycle-scheduling + 201 runner/sync_order tests pass; the one remaining test_sync_order.py failure (test_extract_module_known_languages_comprehensive, lisp) is unrelated and pre-existing on main.

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Pushed c762da5 addressing review findings 2 and 3:

Finding 2 (architecture.json churn) — reverted the previous edit (which re-serialized the file via json.dump and escaped UTF-8 em-dashes to while losing the trailing newline). Re-applied only the intended changes via surgical Edits. Final diff vs main is now 26 lines, all in the two targeted entries; UTF-8 chars and trailing newline preserved.

Finding 3 (missing module_targets) — added module_targets: Optional[Dict[str, str]] = None to the AsyncSyncRunner signature in architecture.json. The parameter was already in the prompt header and the code; only the architecture interface metadata was stale. I also extended the sideEffects entry to mention the new SCC scheduling so the metadata reflects the runner's actual behaviour.

Finding 1 (auto-heal failing) — investigated. The error prompt_path prompts/generated_python.prompt set but missing on disk post-update comes from pdd/ci_drift_heal.py:1271 and is triggered inside Cloud Build (the GHA log only surfaces the wait-for-Cloud-Build failure). I cannot find any reference to prompts/generated_python.prompt in the repo at this PR head — no such file exists, no architecture entry, no prompt include. The same heal and auto-heal checks are currently red on PRs #1121, #1126, and #1131 (#1131 is still in-progress as of this comment), all on unrelated changes. That makes auto-heal look like a pre-existing CI infrastructure issue, not something this PR introduced. Happy to dig further into the Cloud Build logs if someone with GCP access can share them, but I don't think there's anything to fix on this branch.

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Cloud-test verification on c762da538 (PR head)

Ran make cloud-test against the latest PR commit from a clean worktree.

Result: 77 passed, 0 failed, 0 errors (of 77 tasks)

Job: pdd-test-run-20260521-161855-c762da538 + ...-std
Commit: c762da538
77 passed, 0 failed, 0 errors (of 77 tasks)

Breakdown

  • 31 pytest chunks
  • 22 regression cases
  • 15 sync_regression cases
  • 8 cloud_regression cases
  • 1 vitest frontend suite
  • 1 STANDARD-job sync_regression case (immune to SPOT preemption)

Notes

  • First run had 1 error on chunk_16: task killed before completion (likely SPOT preemption) — infrastructure preemption, not a test failure. Re-running make cloud-test-quick on the same commit was clean (Cloud Build image rebuild was skipped since the image-deps hash didn't change). The job IDs are linked above for traceability.
  • The pre-existing local test_extract_module_known_languages_comprehensive failure (lisp language detection in tests/test_sync_order.py) also passes in Cloud Batch — that one is environment-specific to my local checkout.

What this validates beyond the SCC scheduler

  • compute_sccs + the new SCC scheduling paths exercised via every sync_regression case (15 cases) and the full agentic_sync_runner test surface inside the pytest chunks.
  • Architecture metadata updates (sync_order interface + agentic_sync_runner deps + module_targets) survive the full pytest + regression suite.
  • Prompt-side updates (sync_order_python.prompt new compute_sccs requirement + agentic_sync_runner_python.prompt SCC-aware requirement 4) don't break any prompt-consuming tests.

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Independent verification — ship-ready

Verified PR head c762da538 end-to-end in an isolated worktree (/private/tmp/pdd-1125-work, fresh Python 3.13 venv). Zero regressions introduced by this PR.

Evidence

Check Result
Bug repro on main Reproduced exactly — (False, "Succeeded: []. Skipped (not run): [...]", 0.0)
Fix verify on PR head (True, "All 3 modules synced successfully", 0.0)
Full filtered suite at PR head 8809 passed, 59 failed (44 unique paths), 1554s
Same 44 failing paths re-run on main 44 failed — same count → zero regressions
Targeted regression (10 test files importing changed modules) 1207 passed, 1 skip, 1 fail (pre-existing lisp test, also fails on main)
Lint pylint -E -F pdd/sync_order.py pdd/agentic_sync_runner.py 10.00/10
DurableSyncRunner inheritance Safe — overrides __init__/run() but both call super(); doesn't override _get_ready_modules/_get_blocked_modules

Extra blind-spot tests beyond the diff (all pass)

  1. 1000-node cycle through compute_sccs — iterative Tarjan handles it even under sys.setrecursionlimit(200). Runner constructs and serializes correctly.
  2. Two disjoint cycles in one target set — each SCC serializes independently, downstream waits for the right SCC.
  3. Cycle + parallel acyclic chain with a failed cycle member — independent acyclic chain still completes (partial-progress invariant preserved). Message correctly reports Skipped (blocked): not Skipped (not run):.
  4. DurableSyncRunner constructed with a cycle_scc_of / _cyclic_sccs inherited correctly, scheduling uses SCC logic.

Orchestration end-to-end (Phase E)

Drove run_global_sync() in-process with real AsyncSyncRunner + cyclic dep graph, mocking only _sync_one_module (avoids LLM cost). Exercises the full path: architecture parse → analyze → build_dep_graph → AsyncSyncRunner construction → runner.run(). 13/13 checks pass on both the success path and the partial-progress-on-failure path.

A subprocess pdd sync invocation wouldn't add value here — the click layer between argv and run_global_sync doesn't touch dep graphs or scheduling, and a real subprocess would either need LLM calls or the same _sync_one_module patch (which can't cross a subprocess boundary).

Flip-check (sanity)

Temporarily neutralized the SCC init in AsyncSyncRunner.__init__ while keeping the rest of the fix in place. The repro deadlocks again. Confirms the SCC init is load-bearing — unit tests aren't passing for unrelated reasons.

Auto-heal CI failure — infrastructure, not this PR

Discriminating evidence: PR #1133 fails the same heal/auto-heal Cloud Build step without touching architecture.json or any .prompt file. If the failure were caused by this PR's architecture/prompt edits, #1133 couldn't fail the same way. Flag to infra; should not block merge.

Reproducer scripts

Retained in /private/tmp/:

  • pdd-1125-verify-repro.py — issue-1125 graph repro
  • pdd-1125-verify-blindspots.py — algorithm blind-spot tests
  • pdd-1125-verify-e2e.py — orchestration end-to-end

@Serhan-Asad
Copy link
Copy Markdown
Collaborator Author

Note on cycle-member pick order

Worth flagging since it came up in review discussion: the runner picks the first cycle member in basenames order. This is deterministic but arbitrary — within a strongly-connected component, no module is semantically "before" another, but the same input always picks the same module first.

Does the choice matter?

  • Success path: no. All orderings produce the same final state.
  • Failure path: slightly. Example with 3-cycle [A, B, C] where B is buggy:
    • Pick A first → A succeeds → try B → B fails → C blocked. Got 1 done.
    • Pick B first → B fails → A, C both blocked. Got 0 done.

So picking the failing one early wastes a healthy peer's chance. But you can't predict which will fail.

Could it be smarter?

  1. Parallel cycle execution — abandoned to avoid races on shared state.
  2. Treat the cycle as one mega-module, sync together — most principled fix, but a large engine rewrite.
  3. Heuristics (smallest first, least-recently-modified first) — marginal, fragile.

Bottom line: for the motivating case (both cycle members healthy, re-syncing after a prompt change), order doesn't matter. For failures, worst case wastes one module-sync per re-run; user re-runs, fingerprints preserve partial progress, done.

This PR fixes "silent deadlock → works." Failure-recovery optimization is a different scope. The pick not being smart is fine; the pick being deterministic is what matters — reproducible bugs, predictable CI.

One tiny follow-up worth doing: add a regression test that basenames order within a cycle is deterministic from _dependency_ordered_modules (not subject to dict-iteration order on dep_graph). The existing test_compute_sccs_deterministic_regardless_of_dep_order covers the SCC computation; this would cover the orchestration step that feeds it.

Serhan-Asad and others added 3 commits May 21, 2026 21:43
)

AsyncSyncRunner._get_ready_modules required every dep in the target set to
be success before a module could run. With a legitimate dependency cycle
(e.g. agentic_checkup_orchestrator <-> checkup_review_loop using Python
late imports), no cycle member ever satisfies that gate, so the runner
exited with the misleading "Succeeded: []. Skipped (not run): [...]"
message without dispatching any work.

Compute strongly connected components of the basename-induced dep graph
and apply soft-edge semantics inside each cyclic SCC:

- Inside a cyclic SCC (size > 1, or a 1-node SCC with a self-loop),
  intra-SCC edges no longer block readiness; only a failed peer does.
- At most one member of a cyclic SCC runs at a time (serialize cycle
  execution to avoid races on shared include content).
- Cross-SCC consumers wait for the whole upstream SCC, not just the
  named dep, so a downstream module never starts before all cycle
  peers finish, and a failed peer cannot silently advance a consumer.

_get_blocked_modules now walks the SCC condensation (a DAG), so failed
status propagates through cycles correctly. The per-module DFS could
previously cache an incorrect False on cycle re-entry, leaving cycle
peers and external consumers wrongly unblocked.

Adds compute_sccs (iterative Tarjan with deterministic ordering) to
sync_order. The DurableSyncRunner subclass inherits the fix transparently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review findings on top of the initial SCC scheduling commit:

1. `_get_ready_modules` now treats the SCC's effective external deps as
   the union of every member's cross-SCC deps. With graph `a->b, b->[a,x],
   x->[]`, a cycle member `a` previously became ready while its peer's
   external dep `x` was still pending (and could even schedule when `x`
   had already failed). Intra-SCC edges remain soft.

2. The whole-SCC dep walk also fixes the ready/blocked inconsistency:
   when `x` is failed, `_get_ready_modules` no longer returns the cycle
   members while `_get_blocked_modules` flags them.

3. Update the PDD source-of-truth so a future regeneration preserves
   the fix:
   - `prompts/sync_order_python.prompt`: document the new
     `compute_sccs(graph) -> List[List[str]]` requirement.
   - `prompts/agentic_sync_runner_python.prompt`: depend on
     `sync_order_python.prompt`; rewrite the dependency-scheduling
     requirement to describe SCC-aware soft intra-SCC edges, whole
     upstream SCC gating, serialized cycle execution, and condensation-
     based blocked classification.
   - `architecture.json`: add `compute_sccs` to sync_order's interface
     and `sync_order_python.prompt` to agentic_sync_runner's deps.
4. Tighten the `compute_sccs` ordering docstring — Tarjan emits
   deps-first under the project's `A depends on B` edge convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert the previous architecture.json edit (which re-serialized the
whole file and escaped UTF-8 em-dashes to \u sequences while losing the
trailing newline), then re-apply only the three intended fields via
surgical edits:

- Add `compute_sccs` to `sync_order_python.prompt` interface.
- Add `sync_order_python.prompt` to `agentic_sync_runner_python.prompt`
  dependencies.
- Add `module_targets: Optional[Dict[str, str]] = None` to the
  `AsyncSyncRunner` signature (it was already in the prompt and the code,
  but the architecture interface metadata was stale).

Also extend the AsyncSyncRunner sideEffects entry to mention the new
SCC scheduling so the metadata reflects the runner's actual behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gltanaka gltanaka merged commit 6df9c2d into main May 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: pdd sync dependency cycles leave all modules pending

2 participants