Skip to content

feat: add public surface and test churn gates to pdd sync (#1012)#1015

Merged
gltanaka merged 22 commits into
mainfrom
change/issue-1012
May 23, 2026
Merged

feat: add public surface and test churn gates to pdd sync (#1012)#1015
gltanaka merged 22 commits into
mainfrom
change/issue-1012

Conversation

@prompt-driven-github
Copy link
Copy Markdown
Contributor

@prompt-driven-github prompt-driven-github Bot commented May 14, 2026

Summary

Extends the existing ArchitectureConformanceError repair-loop scaffolding with parallel PublicSurfaceRegressionError and TestChurnError gates so pdd sync fails fast with an actionable diagnostic instead of silently rewriting mature modules and discarding broad test coverage.

Closes #1012

Changes Made

Prompts Modified

  • pdd/prompts/code_generator_main_python.prompt — Adds snapshot/diff requirements (steps 5b, 5c), two new exception classes (PublicSurfaceRegressionError, TestChurnError), three stdlib-only helpers, the BREAKING-CHANGE: opt-out marker, and the PDD_TEST_CHURN_THRESHOLD env var (default 40%). First-time generation is exempt.
  • pdd/prompts/sync_orchestration_python.prompt — Extends the generate-op retry loop to catch the new errors, pins .pddrc/strength across retries, and pre-snapshots the existing module before regeneration.
  • pdd/prompts/agentic_sync_runner_python.prompt — Adds _parse_public_surface_failure and _parse_test_churn_failure plus two hard-failure formatters; generalises the stuck-symbol short-circuit.
  • pdd/prompts/sync_main_python.prompt — Extends the one-session Phase 1 retry loop and widens the two import-select lists to include the new error types.

Documentation Updated

  • README.md — Brief mention of the new compatibility gate and the BREAKING-CHANGE: opt-out.
  • architecture.json — Records the 9 new public symbols and 1 new env var (all additive; no removed symbols).

Strategy

Reuses the proven ArchitectureConformanceError plumbing verbatim: PDD_REPAIR_DIRECTIVE, MAX_CONFORMANCE_ATTEMPTS, the env-var protocol, and the hard-failure formatter pattern. Conservative defaults (40% churn threshold), explicit opt-out via BREAKING-CHANGE: in the prompt body, and stdlib-only helpers keep the risk profile low.

Acceptance Criteria Mapping

  • ✅ Compatibility/conformance gate comparing generated public symbols/signatures vs. pre-sync module — PublicSurfaceRegressionError.
  • ✅ Externally imported helpers remain present unless explicitly removed — snapshot/diff in step 5b catches removals; BREAKING-CHANGE: is the opt-out.
  • ✅ Generated test rewrites do not discard broad coverage — TestChurnError with PDD_TEST_CHURN_THRESHOLD.
  • ✅ Broad unrelated churn fails with actionable diagnostic — hard-failure formatters in agentic_sync_runner_python.prompt.

Review Checklist

  • Prompt syntax is valid
  • PDD conventions followed (reuses existing repair-loop plumbing)
  • architecture.json interface entries match the prompts
  • README mentions the new gate and opt-out

Next Steps After Merge

  1. Regenerate code from the modified prompts in dependency order:
    ./.pdd/sync_order_change.sh
    Or manually:
    pdd sync code_generator_main
    pdd sync agentic_sync_runner
    pdd sync one_session_sync
    pdd sync durable_sync_runner
    pdd sync prompts
    pdd sync sync_main
    pdd sync agentic_sync
    pdd sync checkup
    pdd sync ci_drift_heal
    pdd sync generate
    pdd sync maintenance
    pdd sync agentic_checkup
    pdd sync pin_example_hack
    pdd sync sync_orchestration
    
  2. Run pytest -vv tests/ to verify functionality.
  3. Verify the new gates trip on a deliberate regression in a sandbox module.

Created by pdd change workflow

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@Serhan-Asad Serhan-Asad added pdd-issue PDD: autonomous issue solving pdd-sync PDD: sync prompts with code labels May 15, 2026
Serhan-Asad added a commit that referenced this pull request May 15, 2026
When a prior test-churn attempt failed, accumulate its cost (and model
fallback) and fold them into the successful retry's TestResult so the
caller charges every attempt against the sync budget. Mirrors the
generate-op and one-session-sync accumulation patterns (Codex PR #1015
high finding).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 15, 2026
…ollow-ups)

Three independent gate-correctness fixes called out by Codex iter-3 medium
review (#1015):

- `BREAKING-CHANGE:` opt-out parser now accepts identifier wrappers of
  matching back-tick, single-quote, OR double-quote (`"old_helper"`,
  `'Class.method'`). Mismatched quotes are still rejected so a typo can
  never silently whitelist a removal.
- `_snapshot_public_surface` and `_snapshot_public_signatures` recurse
  into public nested classes at all depths so a removed/signature-changed
  `Outer.Inner.method` is caught (previously only direct class methods
  participated, letting nested-class drift escape both diffs).
- `_get_test_churn_threshold` accepts `"40%"`/`"100%"` percent strings
  (strips the suffix, divides by 100, then clamps). Unparseable values
  still default to 0.40 but now log a warning so the silent default is
  visible to operators.

Prompt at `pdd/prompts/code_generator_main_python.prompt` updated to
match the new contract; new test coverage in
`tests/test_code_generator_main.py` exercises each path including
mismatched-wrapper rejection and triple-nested classes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Serhan-Asad Serhan-Asad force-pushed the change/issue-1012 branch 3 times, most recently from 69a18d7 to 741a66c Compare May 16, 2026 16:55
Serhan-Asad added a commit that referenced this pull request May 16, 2026
…n gates

Addresses the 2 Blocker + 2 Important findings from codex review iter-1 of
PR #1015. All findings have a focused regression test added.

- Blocker 1 (binding kind): _snapshot_public_signatures now prefixes each
  class-body method signature with [instance]/[staticmethod]/[classmethod]/
  [property] so a method-binding flip (def f(self, v) -> @staticmethod def
  f(v)) is detected as a signature change even when the receiver-stripped
  params match. Previously Class.f(1) callers could break silently.

- Blocker 2 (YAML front matter): defensive hardening that strips a leading
  YAML --- ... --- front matter block before BREAKING-CHANGE: parsing, so
  an indented opt-out buried in metadata cannot whitelist surface removals
  or test-churn. Verified no shipped prompt currently uses front matter
  (pdd/prompts/ scan), so this guards against a future convention rather
  than fixing a current exploit.

- Important 1 (contract docs): tightens the docstring on
  _verify_code_surface_after_write and the matching README sentence to
  make explicit that crash/fix/verify surface failures are hard failures
  (no PDD_REPAIR_DIRECTIVE retry); the repair loop applies to generate
  and one-session paths only. Backed by a regression test that asserts
  the gate runs exactly once and returns the typed error without retry.

- Important 2 (retry budget): splits the one-session retry accounting so
  public-surface repair uses MAX_CONFORMANCE_ATTEMPTS (parity with the
  generate path) while test-churn keeps its tighter _MAX_TEST_CHURN_ATTEMPTS
  cap. Previously a single min() shared the smaller cap across both gates,
  cutting surface repair off at 2 attempts. Backed by two tests: one
  asserts surface attempts use the full budget, the other pins the churn
  cap at 2.

Test coverage:
- 10 new tests in tests/test_code_generator_main.py covering binding-kind
  detection (instance/staticmethod/classmethod/property flips) and YAML
  front-matter stripping (with unterminated-fence and empty-prompt edge
  cases).
- 2 new tests in tests/test_sync_orchestration.py pinning the hard-failure
  contract of _verify_code_surface_after_write.
- 2 new tests in tests/test_one_session_sync.py exercising the full
  MAX_CONFORMANCE_ATTEMPTS budget for surface repair and the
  _MAX_TEST_CHURN_ATTEMPTS cap for churn repair.

Local pytest run: 570 passed, 0 failed across
test_code_generator_main.py, test_sync_orchestration.py,
test_one_session_sync.py, test_cmd_test_main.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 16, 2026
…n gates

Address three new blockers introduced by the iter-1 follow-up commit on
PR #1015:

* `_strip_yaml_front_matter` now tolerates a leading UTF-8 BOM, CRLF /
  mixed line endings, trailing whitespace on the fence line, and a
  closing fence at EOF (no trailing newline). The previous regex only
  matched `---\n...---\n`, so a Windows-saved prompt could leave
  `BREAKING-CHANGE:` metadata visible to the directive parser and
  silently opt out of the gates.

* `_snapshot_public_signatures` now accumulates property accessor
  roles per class member before recording the final entry. Previously
  `@x.setter def x(self, value)` overwrote the `@property` getter in
  the signatures dict and was misclassified as `[instance] (value)`,
  so a rewrite that replaced the whole descriptor with a plain
  `def x(self, value)` produced an identical snapshot and bypassed
  the gate. The merged entry is now `[property:<sorted-roles>]
  <getter-signature>`, distinct from any plain-method snapshot.

* Top-level public symbols now carry a kind prefix (`[function]`,
  `[async_function]`, `[class]`) so swapping a public class with a
  same-named function (or a sync function with `async def`) is
  detected even when the receiver-stripped parameter list happens to
  match. The class self-entry inside `_walk_class` mirrors this with
  `[class]`, applying transitively to nested classes.

Adds 17 tests across `TestPublicSurfaceBindingKind` and
`TestFrontMatterStripping` covering each new edge case (property /
property+setter / setter-loss flips, class/function/async flips, CRLF /
BOM / EOF-terminated / trailing-whitespace fences, and end-to-end
verification that buried directives no longer opt out of the removal
gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 16, 2026
Address four findings from the external reviewer:

1. Blocker (pdd/one_session_sync.py): `except TestChurnError` now
   restores the pre-session code file in addition to the test file,
   mirroring the `PublicSurfaceRegressionError` handler's pattern. A
   failed one-session attempt previously left mutated code on disk
   even when the test rewrite was rejected, so subsequent `pdd sync`
   runs treated the mutated code as the baseline.

2. High (pdd/code_generator_main.py): `_format_python_signature` now
   emits a literal `/` token between positional-only and regular args
   so `def f(x, /, y)` and `def f(x, y)` produce DIFFERENT snapshots.
   Previously both forms snapshotted as `(x, y)` and the public-surface
   gate missed a real ABI break (kwarg callers succeed against the
   second form, fail against the first). `skip_first` strips from the
   posonly group when present, otherwise from `args.args`, so the
   marker insertion index stays correct.

3. Medium (pdd/prompts/sync_orchestration_python.prompt): the
   non-generate code-write surface gate bullet now says `break`-ing
   the operation loop (matching the actual code at
   sync_orchestration.py:~2825) and mirrors the inline rationale
   that surface regression is a contract violation, not a transient
   failure — `continue` would let `sync_determine_operation` re-select
   the same operation and spin indefinitely on deterministic bad
   output.

4. High (architecture.json): refreshed signature strings for
   `update_file_pair` (added `strength`/`temperature` params) and
   `commit_and_push` (now takes `Sequence[DriftInfo]` with defaults
   and the new `finalized_modules` parameter) to match the live
   function signatures.

Tests:
- TestOneSessionRollback::test_test_churn_restores_code_and_test
  pins the blocker fix: mock rewrites both code and test, asserts
  both files restored after TestChurnError propagates.
- TestPublicSurfaceBindingKind: four new tests for the posonly fix
  (added/removed/unchanged + a direct snapshot-string assertion).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 16, 2026
Three edge cases reviewer probes found after the iter-3 fix landed:

* `_snapshot_public_signatures` now records top-level imports as
  `[import]` / `[import:<module>]` / `[import:from <module>:<source>]`
  entries (parity with the surface helper). Without this entry, a
  silent re-export break like `from pathlib import Path` -> `def
  Path(): pass` produced no snapshot diff because the surface set
  unchanged while the signatures dict treated the new function as a
  brand-new symbol.
* `_walk_class` now synthesises a constructor signature from class-
  level `AnnAssign` annotations when an `@dataclass` / `@dataclasses.
  dataclass` decorated class has no explicit `__init__`. Adding a
  required field (`name: str` -> `name: str; age: int`) now flips the
  snapshot. `ClassVar` / `InitVar` annotations are excluded; explicit
  `__init__` still takes precedence. Third-party `attrs` / `pydantic`
  decorators are out of scope (documented).
* `build_public_surface_hard_failure_from_error` and
  `build_test_churn_hard_failure_from_error` now include the
  `BREAKING-CHANGE:` opt-in instruction line required by item 9d of
  the runner prompt.

Tests cover the four import-flip variants, six dataclass scenarios,
and two hard-fail string format checks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 16, 2026
Addresses 4 review findings on commit 6523cfa:

- High: dataclass init synthesizer now honours @DataClass(kw_only=True),
  the in-body `_: KW_ONLY` sentinel, and InitVar[...] fields. The
  sentinel check runs before the underscore-prefix skip so the canonical
  marker is recognised; decorator wins over sentinel (single `*`); a
  trailing `*` without kw-only fields is stripped.
- Medium: fields declared with `field(init=False, ...)` (bare or
  `dataclasses.field` attribute form) are excluded from the synth,
  matching the stdlib runtime. `init=True`, missing `init` kwarg, and
  non-`field` calls still include the field.
- Medium: AsyncSyncRunner's internal `_build_public_surface_hard_failure`
  and `_build_test_churn_hard_failure` blocks now carry the
  BREAKING-CHANGE: opt-out instruction so end users see the directive
  on the subprocess hard-failure path (not just the importable
  builders).
- Medium: prompt and architecture.json describe the new dataclass synth
  rules (kw_only / KW_ONLY / InitVar included; ClassVar + init=False
  excluded; explicit __init__ wins) plus the kind-prefix invariant and
  import-binding formats so prompt-driven regeneration preserves them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 16, 2026
Two more dataclass edge cases probed by the external reviewer on
``93d7fe69d``:

1. ``@dataclass(init=False)`` was ignored. The synth unconditionally
   built a constructor signature from class-body ``AnnAssign`` nodes
   when a dataclass decorator was present, but ``init=False`` keeps
   the class's natural (zero-arg) ``__init__``. Now detect the kwarg
   on Call decorators (``@dataclass(init=False)`` and
   ``@dataclasses.dataclass(init=False)``) and emit ``()`` instead of
   walking the fields. Flipping ``init=False`` -> default trips the
   gate; adding fields under ``init=False`` does not.

2. Inherited dataclass fields were not merged into the synthesised
   constructor signature. Now walk same-module ``Name`` bases left-
   to-right depth-first with a cycle guard, prepending each
   ``@dataclass`` base's fields (and ITS bases' fields, recursively)
   in front of the derived class's own fields. Override semantics:
   when both base and derived declare the same field name, the
   derived's annotation/default replaces the base's, preserving the
   base's slot via insertion-order dict semantics — matching what
   ``@dataclass`` actually synthesises.

   Cross-module / attribute-form bases (``class User(pkg.Base)``)
   that we cannot resolve from this module's AST get an
   ``[inherited_unresolved]`` marker prepended to the synth signature
   so local field changes still diff while invisible base changes
   correctly slip through.

V1 limitations (documented in the helper docstring):

- A base decorated with ``@dataclass(init=False)`` still contributes
  declared fields to a derived ``@dataclass``'s synth at runtime, but
  we treat such bases as non-dataclass for the purposes of inherited-
  field merging.
- The ``KW_ONLY`` sentinel does not propagate across the inheritance
  boundary — each class re-emits its own kw-only context.

Tests in ``TestPublicSurfaceBindingKind``:

- ``test_dataclass_init_false_decorator_flip_to_default_is_detected``
- ``test_dataclass_init_false_decorator_keeps_zero_arg_signature``
- ``test_dataclasses_dataclass_init_false_form_works``
- ``test_dataclass_inherited_field_added_to_private_base_is_detected``
- ``test_dataclass_unchanged_inheritance_is_allowed``
- ``test_dataclass_unresolved_base_is_marked_uncertain``
- ``test_dataclass_multiple_inheritance_left_to_right``
- ``test_dataclass_explicit_init_still_takes_precedence_with_inheritance``

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 16, 2026
- Walk dataclass bases in reverse-MRO order so the synthesised
  ``__init__`` matches Python's ``@dataclass`` semantics
  (``__dataclass_fields__`` is populated by reversed iteration so
  later bases override earlier ones in dict insertion order). For
  ``class C(A, B)`` the synth becomes ``(b, a, c)``.
- Stop skipping ``@dataclass(init=False)`` bases: their fields STILL
  live in ``__dataclass_fields__`` and the derived class merges them.
  ``init=False`` only suppresses the base's OWN init synthesis.
- Update the existing multiple-inheritance test (renamed to
  ``test_dataclass_multiple_inheritance_mro_order``) and add diamond,
  single-inheritance regression, and init=False-base inheritance
  tests.
- Document reverse-MRO ordering and init=False inheritance in the
  prompt and architecture.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
- Mirror prompt deps into architecture.json for update_main (was
  2/9) and ci_drift_heal (was 7/9) so global sync can topologically
  order stale dependents correctly.
- Surface repair_directive (#1012 F-H) on the canonical signatures
  for cmd_test_main and run_agentic_test_generate in
  architecture.json, plus the cmd_test_main_python.prompt Inputs
  section and the agentic_test_generate_python.prompt Entry Point.
  Also fix run_agentic_test_generate's pre-existing stale 4-arg
  Tuple[bool, str, str] signature in architecture.json.
- Honor the anchored BREAKING-CHANGE prompt opt-out (parsed by
  _prompt_allows_test_churn) in the deletion-as-churn branches at
  cmd_test_main.py:230 and one_session_sync.py:609, closing the
  asymmetry where the prompt opt-out worked for wholesale rewrites
  but silently failed when the agent emptied/deleted the test
  file. Import the helper into both files and add regression tests
  in test_cmd_test_main.py and test_one_session_sync.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Address three findings from the external review of the public-surface
and test-churn gates:

1. P1 — `_is_test_output_path` now also matches `<name>_test.go` /
   `<name>_test.rb` / `<name>_spec.rb` so Go and Ruby sibling test
   files no longer bypass the churn gate when they live next to
   production code instead of under `tests/`.

2. P1 — `_snapshot_public_surface` skips `from __future__ import ...`
   directives. Removing the directive after a Python-version bump is
   harmless cleanup, but the gate previously flagged `annotations`
   (and any other future name) as a removed public symbol.

3. P2 — `_verify_public_surface_regression` expands BREAKING-CHANGE
   class-removal directives to cover every captured descendant. With
   `BREAKING-CHANGE: remove Service` the caller no longer has to
   enumerate `Service.run`, `Service.Inner.go`, etc. Signature-change
   directives stay strict — auto-expansion there would silently mask
   real regressions.

Additionally:
- sync_main retry message: report "Public surface regression" /
  "Test churn threshold exceeded" instead of mislabeling them as
  "Architecture conformance failed".
- README: document the full public-surface coverage (constants and
  re-exported imports) and the new class-removal descendant rule.
- CHANGELOG: restore the v0.0.238 Chore/Test subsections the original
  feat commit accidentally dropped.

Regression coverage added in TestSyncCompatibilityGates: Go/Ruby
sibling-file churn, `__future__` import ignored by the gate, and
`BREAKING-CHANGE: remove Class` covering nested descendants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Address three follow-up findings from the next review pass:

1. P1 — `_snapshot_public_signatures` also skips `from __future__
   import …`. Without this the surface helper would return early but
   the signatures dict still records `[import:from __future__]`, so a
   regenerated module that drops the directive and binds a real
   `annotations = …` value would falsely diff as a
   `[import:from __future__]` → `[assignment]` signature change.

2. P1 — `_is_test_output_path` now also matches PascalCase JVM-style
   suffixes: `Test.java`, `Tests.java`, `Test.kt`, `Tests.kt`,
   `Spec.kt`. The agentic test prompt already names `Test.java` as a
   recognised convention; without this match `src/test/java/FooTest.java`
   bypassed the churn gate. Case-sensitive on purpose so `latest.kt`,
   `manifest.java`, `request.scala` do not false-positive.

3. P1 — Prompt drift: `pdd/prompts/code_generator_main_python.prompt`
   now documents the `__future__` skip in BOTH `_snapshot_public_surface`
   and `_snapshot_public_signatures`, the full PascalCase / sibling-
   file test-suffix list in the test-churn gate description, and the
   `BREAKING-CHANGE: remove <Class>` descendant-expansion rule. A
   future `pdd sync code_generator_main` can now regenerate without
   reverting iter-9 + iter-10.

Nit:
- `run_agentic_test_generate` annotated return type was a stale
  `tuple[str, float, str, bool]` (and the docstring claimed a 4-tuple)
  while the actual return is `TestResult` (5 fields). Updated both.

Regression coverage added in TestSyncCompatibilityGates:
- `from __future__ import annotations` rebinding to a real
  `annotations = {...}` value (the exact reviewer-named scenario).
- `FooTest.java` and `WidgetSpec.kt` churn detection.
- `latest.kt` NOT classified as a test (PascalCase false-positive
  regression guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Three follow-up findings from the next review pass:

1. P1 — Expand `_is_test_output_path` to cover the remaining common
   test-file conventions the gate was skipping: Maven failsafe
   integration tests (`FooIT.java`), older JUnit/TestNG TestCase
   convention (`FooTestCase.java`), ScalaTest (`Test.scala`,
   `Tests.scala`, `Spec.scala`), Swift (`Tests.swift`), .NET xUnit/
   NUnit (`Test.cs`, `Tests.cs`), and Rust sibling tests
   (`_test.rs`). All PascalCase JVM/.NET/Swift entries stay
   case-sensitive so `latest.kt`/`manifest.java`/`request.scala`
   continue NOT to false-positive.

2. P1 — Close the alt-path test churn loophole: when
   `run_agentic_test_generate` falls back to reading an alternate
   path (e.g. `__tests__/foo.test.ts`) because the canonical
   `output_test_file` is absent, recover the alt path's pre-existing
   content via `git show HEAD:<rel>` and re-run the test-churn check
   against THAT baseline (not the empty canonical). On
   `TestChurnError`: attach `total_cost`/`model_name` and restore the
   pre-existing content to disk so the repair loop sees the canonical
   pre-sync state. Untracked / never-committed alt files harmlessly
   return empty and fall through to first-time-generation behavior.
   Limitation: when the agent writes to BOTH the canonical output
   AND an alt path, only the canonical churn is gated; the alt-path
   delta in that dual-write case is still unguarded. The reviewer's
   named scenario (canonical absent, alt rewritten) is fully covered.

3. P1 — `_collect_bound_module_names` also skips `from __future__
   import …` so `__all__ = ["annotations"]` no longer promotes the
   future binding into the public surface via the `__all__`-
   authoritative branch. Without this skip a clean-up that drops
   `from __future__ import annotations` AND the matching `__all__`
   entry would falsely diff as `removed: annotations`.

Prompt sync:
- `code_generator_main_python.prompt`: full extended test-suffix list
  in the test-churn gate description, and the parallel `__future__`
  skip in `_collect_bound_module_names` documented alongside the
  earlier skip notes for `_snapshot_public_surface` and
  `_snapshot_public_signatures`.
- `agentic_test_generate_python.prompt`: documents the alt-path
  git-recovery churn check, the `_read_prior_content_from_git`
  helper contract, and the restore-on-failure requirement.

Regression coverage:
- `FooIT.java`, `FooTestCase.java`, `FooSpec.scala`, and
  `widget_test.rs` churn detection.
- `__all__ = ["annotations"]` + `from __future__ import annotations`
  removal MUST NOT trip the gate.
- `run_agentic_test_generate` alt-path rewrite with a tracked prior
  (mocked `_read_prior_content_from_git` returning 20 tests) raises
  `TestChurnError` AND restores the prior content; untracked prior
  (mocked empty) falls through with no raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
…n gates

Addresses the 2 Blocker + 2 Important findings from codex review iter-1 of
PR #1015. All findings have a focused regression test added.

- Blocker 1 (binding kind): _snapshot_public_signatures now prefixes each
  class-body method signature with [instance]/[staticmethod]/[classmethod]/
  [property] so a method-binding flip (def f(self, v) -> @staticmethod def
  f(v)) is detected as a signature change even when the receiver-stripped
  params match. Previously Class.f(1) callers could break silently.

- Blocker 2 (YAML front matter): defensive hardening that strips a leading
  YAML --- ... --- front matter block before BREAKING-CHANGE: parsing, so
  an indented opt-out buried in metadata cannot whitelist surface removals
  or test-churn. Verified no shipped prompt currently uses front matter
  (pdd/prompts/ scan), so this guards against a future convention rather
  than fixing a current exploit.

- Important 1 (contract docs): tightens the docstring on
  _verify_code_surface_after_write and the matching README sentence to
  make explicit that crash/fix/verify surface failures are hard failures
  (no PDD_REPAIR_DIRECTIVE retry); the repair loop applies to generate
  and one-session paths only. Backed by a regression test that asserts
  the gate runs exactly once and returns the typed error without retry.

- Important 2 (retry budget): splits the one-session retry accounting so
  public-surface repair uses MAX_CONFORMANCE_ATTEMPTS (parity with the
  generate path) while test-churn keeps its tighter _MAX_TEST_CHURN_ATTEMPTS
  cap. Previously a single min() shared the smaller cap across both gates,
  cutting surface repair off at 2 attempts. Backed by two tests: one
  asserts surface attempts use the full budget, the other pins the churn
  cap at 2.

Test coverage:
- 10 new tests in tests/test_code_generator_main.py covering binding-kind
  detection (instance/staticmethod/classmethod/property flips) and YAML
  front-matter stripping (with unterminated-fence and empty-prompt edge
  cases).
- 2 new tests in tests/test_sync_orchestration.py pinning the hard-failure
  contract of _verify_code_surface_after_write.
- 2 new tests in tests/test_one_session_sync.py exercising the full
  MAX_CONFORMANCE_ATTEMPTS budget for surface repair and the
  _MAX_TEST_CHURN_ATTEMPTS cap for churn repair.

Local pytest run: 570 passed, 0 failed across
test_code_generator_main.py, test_sync_orchestration.py,
test_one_session_sync.py, test_cmd_test_main.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
…n gates

Address three new blockers introduced by the iter-1 follow-up commit on
PR #1015:

* `_strip_yaml_front_matter` now tolerates a leading UTF-8 BOM, CRLF /
  mixed line endings, trailing whitespace on the fence line, and a
  closing fence at EOF (no trailing newline). The previous regex only
  matched `---\n...---\n`, so a Windows-saved prompt could leave
  `BREAKING-CHANGE:` metadata visible to the directive parser and
  silently opt out of the gates.

* `_snapshot_public_signatures` now accumulates property accessor
  roles per class member before recording the final entry. Previously
  `@x.setter def x(self, value)` overwrote the `@property` getter in
  the signatures dict and was misclassified as `[instance] (value)`,
  so a rewrite that replaced the whole descriptor with a plain
  `def x(self, value)` produced an identical snapshot and bypassed
  the gate. The merged entry is now `[property:<sorted-roles>]
  <getter-signature>`, distinct from any plain-method snapshot.

* Top-level public symbols now carry a kind prefix (`[function]`,
  `[async_function]`, `[class]`) so swapping a public class with a
  same-named function (or a sync function with `async def`) is
  detected even when the receiver-stripped parameter list happens to
  match. The class self-entry inside `_walk_class` mirrors this with
  `[class]`, applying transitively to nested classes.

Adds 17 tests across `TestPublicSurfaceBindingKind` and
`TestFrontMatterStripping` covering each new edge case (property /
property+setter / setter-loss flips, class/function/async flips, CRLF /
BOM / EOF-terminated / trailing-whitespace fences, and end-to-end
verification that buried directives no longer opt out of the removal
gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Address four findings from the external reviewer:

1. Blocker (pdd/one_session_sync.py): `except TestChurnError` now
   restores the pre-session code file in addition to the test file,
   mirroring the `PublicSurfaceRegressionError` handler's pattern. A
   failed one-session attempt previously left mutated code on disk
   even when the test rewrite was rejected, so subsequent `pdd sync`
   runs treated the mutated code as the baseline.

2. High (pdd/code_generator_main.py): `_format_python_signature` now
   emits a literal `/` token between positional-only and regular args
   so `def f(x, /, y)` and `def f(x, y)` produce DIFFERENT snapshots.
   Previously both forms snapshotted as `(x, y)` and the public-surface
   gate missed a real ABI break (kwarg callers succeed against the
   second form, fail against the first). `skip_first` strips from the
   posonly group when present, otherwise from `args.args`, so the
   marker insertion index stays correct.

3. Medium (pdd/prompts/sync_orchestration_python.prompt): the
   non-generate code-write surface gate bullet now says `break`-ing
   the operation loop (matching the actual code at
   sync_orchestration.py:~2825) and mirrors the inline rationale
   that surface regression is a contract violation, not a transient
   failure — `continue` would let `sync_determine_operation` re-select
   the same operation and spin indefinitely on deterministic bad
   output.

4. High (architecture.json): refreshed signature strings for
   `update_file_pair` (added `strength`/`temperature` params) and
   `commit_and_push` (now takes `Sequence[DriftInfo]` with defaults
   and the new `finalized_modules` parameter) to match the live
   function signatures.

Tests:
- TestOneSessionRollback::test_test_churn_restores_code_and_test
  pins the blocker fix: mock rewrites both code and test, asserts
  both files restored after TestChurnError propagates.
- TestPublicSurfaceBindingKind: four new tests for the posonly fix
  (added/removed/unchanged + a direct snapshot-string assertion).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Three edge cases reviewer probes found after the iter-3 fix landed:

* `_snapshot_public_signatures` now records top-level imports as
  `[import]` / `[import:<module>]` / `[import:from <module>:<source>]`
  entries (parity with the surface helper). Without this entry, a
  silent re-export break like `from pathlib import Path` -> `def
  Path(): pass` produced no snapshot diff because the surface set
  unchanged while the signatures dict treated the new function as a
  brand-new symbol.
* `_walk_class` now synthesises a constructor signature from class-
  level `AnnAssign` annotations when an `@dataclass` / `@dataclasses.
  dataclass` decorated class has no explicit `__init__`. Adding a
  required field (`name: str` -> `name: str; age: int`) now flips the
  snapshot. `ClassVar` / `InitVar` annotations are excluded; explicit
  `__init__` still takes precedence. Third-party `attrs` / `pydantic`
  decorators are out of scope (documented).
* `build_public_surface_hard_failure_from_error` and
  `build_test_churn_hard_failure_from_error` now include the
  `BREAKING-CHANGE:` opt-in instruction line required by item 9d of
  the runner prompt.

Tests cover the four import-flip variants, six dataclass scenarios,
and two hard-fail string format checks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Addresses 4 review findings on commit 6523cfa:

- High: dataclass init synthesizer now honours @DataClass(kw_only=True),
  the in-body `_: KW_ONLY` sentinel, and InitVar[...] fields. The
  sentinel check runs before the underscore-prefix skip so the canonical
  marker is recognised; decorator wins over sentinel (single `*`); a
  trailing `*` without kw-only fields is stripped.
- Medium: fields declared with `field(init=False, ...)` (bare or
  `dataclasses.field` attribute form) are excluded from the synth,
  matching the stdlib runtime. `init=True`, missing `init` kwarg, and
  non-`field` calls still include the field.
- Medium: AsyncSyncRunner's internal `_build_public_surface_hard_failure`
  and `_build_test_churn_hard_failure` blocks now carry the
  BREAKING-CHANGE: opt-out instruction so end users see the directive
  on the subprocess hard-failure path (not just the importable
  builders).
- Medium: prompt and architecture.json describe the new dataclass synth
  rules (kw_only / KW_ONLY / InitVar included; ClassVar + init=False
  excluded; explicit __init__ wins) plus the kind-prefix invariant and
  import-binding formats so prompt-driven regeneration preserves them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
Two more dataclass edge cases probed by the external reviewer on
``93d7fe69d``:

1. ``@dataclass(init=False)`` was ignored. The synth unconditionally
   built a constructor signature from class-body ``AnnAssign`` nodes
   when a dataclass decorator was present, but ``init=False`` keeps
   the class's natural (zero-arg) ``__init__``. Now detect the kwarg
   on Call decorators (``@dataclass(init=False)`` and
   ``@dataclasses.dataclass(init=False)``) and emit ``()`` instead of
   walking the fields. Flipping ``init=False`` -> default trips the
   gate; adding fields under ``init=False`` does not.

2. Inherited dataclass fields were not merged into the synthesised
   constructor signature. Now walk same-module ``Name`` bases left-
   to-right depth-first with a cycle guard, prepending each
   ``@dataclass`` base's fields (and ITS bases' fields, recursively)
   in front of the derived class's own fields. Override semantics:
   when both base and derived declare the same field name, the
   derived's annotation/default replaces the base's, preserving the
   base's slot via insertion-order dict semantics — matching what
   ``@dataclass`` actually synthesises.

   Cross-module / attribute-form bases (``class User(pkg.Base)``)
   that we cannot resolve from this module's AST get an
   ``[inherited_unresolved]`` marker prepended to the synth signature
   so local field changes still diff while invisible base changes
   correctly slip through.

V1 limitations (documented in the helper docstring):

- A base decorated with ``@dataclass(init=False)`` still contributes
  declared fields to a derived ``@dataclass``'s synth at runtime, but
  we treat such bases as non-dataclass for the purposes of inherited-
  field merging.
- The ``KW_ONLY`` sentinel does not propagate across the inheritance
  boundary — each class re-emits its own kw-only context.

Tests in ``TestPublicSurfaceBindingKind``:

- ``test_dataclass_init_false_decorator_flip_to_default_is_detected``
- ``test_dataclass_init_false_decorator_keeps_zero_arg_signature``
- ``test_dataclasses_dataclass_init_false_form_works``
- ``test_dataclass_inherited_field_added_to_private_base_is_detected``
- ``test_dataclass_unchanged_inheritance_is_allowed``
- ``test_dataclass_unresolved_base_is_marked_uncertain``
- ``test_dataclass_multiple_inheritance_left_to_right``
- ``test_dataclass_explicit_init_still_takes_precedence_with_inheritance``

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Serhan-Asad added a commit that referenced this pull request May 18, 2026
- Walk dataclass bases in reverse-MRO order so the synthesised
  ``__init__`` matches Python's ``@dataclass`` semantics
  (``__dataclass_fields__`` is populated by reversed iteration so
  later bases override earlier ones in dict insertion order). For
  ``class C(A, B)`` the synth becomes ``(b, a, c)``.
- Stop skipping ``@dataclass(init=False)`` bases: their fields STILL
  live in ``__dataclass_fields__`` and the derived class merges them.
  ``init=False`` only suppresses the base's OWN init synthesis.
- Update the existing multiple-inheritance test (renamed to
  ``test_dataclass_multiple_inheritance_mro_order``) and add diamond,
  single-inheritance regression, and init=False-base inheritance
  tests.
- Document reverse-MRO ordering and init=False inheritance in the
  prompt and architecture.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@gltanaka
Copy link
Copy Markdown
Contributor

Review verdict: needed, but not merge-ready yet.

The underlying issue in #1012 is real: the existing architecture conformance gate only checks declared architecture/interface symbols and does not protect incumbent public helpers or broad existing test coverage. A compatibility gate in this area is warranted.

Blocking required change: the agentic test-churn gate still misses existing alternate-path test rewrites whenever the canonical output file already exists. In pdd/agentic_test_generate.py, generated_content is read from output_test_file first, and the alternate-path scan only runs under if not generated_content and changed_files (around lines 336-391). So if tests/widget.test.ts already exists and the agent rewrites a pre-existing __tests__/widget.test.ts from 20 tests down to 1 while leaving the canonical file intact, run_agentic_test_generate returns success and no TestChurnError is raised. cmd_test_main then only verifies/writes the canonical path (around lines 171-212), leaving the high-churn alt-path rewrite on disk.

I reproduced this locally against the PR head with:

  • pre-existing canonical tests/widget.test.ts unchanged
  • pre-existing alt-path __tests__/widget.test.ts with 20 tests
  • mocked agentic run rewriting only the alt-path file to 1 test and returning success

Observed result: run_agentic_test_generate returned agentic_success=True, and the alt-path file remained reduced to 1 line. This is a false negative for the PR's stated test-churn guarantee.

Required changes before merge:

  1. Snapshot all existing test-like files before agentic runs and, after the run, compare every changed pre-existing test-like file against its snapshot, regardless of whether the canonical output file exists or has content.
  2. Restore every rejected changed test file to its pre-run content before raising TestChurnError.
  3. Add regression coverage for the canonical-present + alt-path-existing case, e.g. canonical test file unchanged while __tests__/widget.test.ts shrinks from many tests to one. Please cover the analogous one-session path too if that agent can touch multiple test files.

The focused PR tests I ran are green (tests/test_agentic_test_generate.py, tests/test_cmd_test_main.py -k "churn or agentic", and tests/test_one_session_sync.py -k "churn or surface"), but they do not cover this false-negative scenario.

Serhan-Asad pushed a commit that referenced this pull request May 22, 2026
…e (Greg review #1015)

Greg's review of PR #1015 surfaced a false negative in the
agentic test-churn gate: when the canonical output file exists with
content AND the agent rewrites a separate pre-existing test-like file
(e.g. shrinks `__tests__/widget.test.ts` from 20 tests to 1) while
leaving the canonical untouched, no `TestChurnError` was raised.

Root cause — `pdd/agentic_test_generate.py`:
The alt-path branch (`if not generated_content and changed_files`)
only ran a churn check on alt-path files when the canonical was
EMPTY. With canonical present and unchanged, the alt-path scan was
skipped entirely; `cmd_test_main`'s outer churn check then compared
the (unchanged) canonical against itself and trivially passed.

Root cause — `pdd/one_session_sync.py`:
The churn gate compared only `test_path` (canonical). The one-session
agent runs under a mega-prompt with no scope constraint and can touch
any test-like file in the project, so the same false negative applied.

Fix:

1. `pdd/agentic_test_generate.py`: replace the alt-path-only churn
   check with a broad sweep over every changed test-like file that
   existed pre-run (snapshot already taken via
   `_snapshot_pre_test_contents`). Restore EVERY pre-run snapshot
   before raising so the retry loop's next attempt re-snapshots from
   the true pre-run baseline — restoring only the violating file
   would let attempt-N rewrites become attempt-N+1's new baseline and
   permanently weaken the gate.

2. `pdd/one_session_sync.py`: add a pre-session snapshot of every
   test-like file in the project (lives outside the retry loop, single
   source of truth across attempts). Inside the existing churn gate
   try-block, sweep changed test-like files against the snapshot
   before running the canonical-only check. Skip the canonical path
   in the sweep so the canonical's deletion-as-churn shortcut and
   repair directive remain intact. The opt-out env flags and prompt-
   side `BREAKING-CHANGE: rewrite tests` were moved above the
   canonical-empty early-return so they apply to the alt-path sweep
   too. The except handler now restores every snapshot in addition to
   the canonical + code paths.

Test coverage:
- `tests/test_agentic_test_generate.py`:
  - `test_canonical_present_alt_path_rewrite_raises_churn_error`
    (Greg's repro: canonical unchanged, alt-path 20→1)
  - `test_canonical_present_multiple_alt_paths_all_restored_on_failure`
    (restore-all invariant)
- `tests/test_one_session_sync.py::TestOneSessionRollback`:
  - `test_alt_path_rewrite_with_canonical_unchanged_raises_churn_error`
    (analogous one-session repro)

Both new test bodies were confirmed to FAIL against the pre-fix
code (via `git stash` of the patched module) and PASS against the
patched code. Existing 12-scenario gate matrix remains 12/12 PASS.

Closes Greg-review blocking concern on PR #1015.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Serhan-Asad
Copy link
Copy Markdown
Collaborator

Addresses @gltanaka review — iter-15 (commit 7c8b3bb22)

Pushed 7c8b3bb22 on change/issue-1012. Closes the multi-file churn false negative you described. Summary of what changed and how it was verified.

Root cause — confirmed

In pdd/agentic_test_generate.py, the alt-path churn block at lines 365-421 ran only under if not generated_content and changed_files: — i.e., the canonical-empty branch. When the canonical output file existed with content AND the agent rewrote a separate pre-existing test-like file (your __tests__/widget.test.ts 20→1 repro), alt_path_used stayed None, the alt-path churn re-check was skipped, and cmd_test_main's outer churn check then compared the (unchanged) canonical against itself and trivially passed. False negative.

pdd/one_session_sync.py had the analogous gap: the gate compared only test_path (canonical), but the one-session agent runs under a mega-prompt with no scope constraint and can touch any test-like file. Same false negative when the agent rewrites an alt-path while leaving canonical alone.

Fix — both files

pdd/agentic_test_generate.py — replace the alt-path-only check with a broad sweep over EVERY changed test-like file that existed pre-run (the working-tree snapshot was already being taken via _snapshot_pre_test_contents; we just weren't using it broadly enough). On the first churn violation, restore every pre-run snapshot (not just the violating file) before raising — so the repair loop's next attempt re-snapshots from the true pre-run baseline. Restoring only the violator would let attempt-N rewrites become attempt-N+1's new baseline and permanently weaken the gate.

pdd/one_session_sync.py — add a pre-session _snapshot_pre_test_contents call outside the retry loop (single source of truth across attempts). Inside the existing churn-gate try block, sweep changed test-like files against the snapshot before running the canonical-only check. Skip the canonical path in the sweep so its deletion-as-churn shortcut + repair directive remain canonical-specific. The opt-out env flags + prompt-side BREAKING-CHANGE: rewrite tests parser were moved above the canonical-empty early-return so they apply to the alt-path sweep too. The except TestChurnError handler now restores every snapshotted file in addition to the canonical + code paths.

Test coverage — all three of your asks

1. Canonical-present + alt-path-existing on the agentic path:

  • tests/test_agentic_test_generate.py::test_canonical_present_alt_path_rewrite_raises_churn_error — exact repro of your scenario (canonical unchanged, alt-path 20→1). Asserts TestChurnError raised, alt-path restored to pre-run snapshot, canonical untouched, cost/model attached.

2. Multi-violator restore-all invariant:

  • tests/test_agentic_test_generate.py::test_canonical_present_multiple_alt_paths_all_restored_on_failure — agent rewrites two alt-path files in one run. Asserts BOTH are restored (not just the one whose violation raised), so the next attempt's snapshot reflects the true pre-run state.

3. Analogous one-session path:

Confirmed test-fails-without-fix

For each new regression test, I verified the test FAILS against the pre-fix code by:

  1. git stash push pdd/<module>.py (revert just the fixed module)
  2. Run the new test → DID NOT RAISE <class 'TestChurnError'>
  3. git stash pop → re-apply fix
  4. Run the new test → PASS

So the new tests genuinely close your reported false negative; they aren't testing what they assert on a no-op patch.

Wider verification

Layer Pre-fix Post-fix
8 PR-touched test files (focused gate suites) 983 pass, 1 skip 986 pass, 1 skip (+3 new regression tests)
12-scenario gate matrix (scenario_matrix.py) 12/12 12/12
All existing test_one_session_sync.py cases 81 pass 81 pass (no behavior drift on canonical-only paths)

CI + cloud-batch

GitHub CI re-triggered automatically on the push. Cloud Batch re-submitted (run-20260522-152907-7c8b3bb22, currently uploading source). I'll post the cloud-batch result once it lands; given iter-13 + iter-14 already passed 77/77 and this commit is additive (no existing test deletions, no behavior changes on the canonical-only path), I'd expect the same outcome — but verifying.

Open items intentionally not in this commit

  • Deletion-as-churn for alt-paths. Your repro is a rewrite; this commit doesn't extend deletion-as-churn to alt-paths (the canonical deletion case is still handled). If you'd like that covered, easiest follow-up is to add a sibling sweep over pre_test_contents keys that no longer exist post-session. Happy to add it on request — kept scope tight per your "the canonical-present + alt-path-existing case" framing.
  • PDD_TEST_CHURN_THRESHOLD=0.40 default. Unchanged this commit; carry-over flag from prior verification rounds. Independent of the false-negative fix.

Serhan-Asad and others added 20 commits May 22, 2026 15:45
Adds two repair-loop gates to prevent pdd sync from silently removing
mature module APIs/tests when regenerating code.

**PublicSurfaceRegressionError** — for mature Python modules (non-empty
pre-generation code file), snapshots the public surface via stdlib `ast`
before writing the new generation and raises when symbols are removed
or signatures change. Captured surface includes:
- Top-level functions and classes (recursive nested classes/methods at
  every depth — `Outer.Inner.method`).
- Module-level public Assign / AnnAssign-with-value / Import /
  ImportFrom targets (so `import git` removal raises).
- `__all__` is honored as authoritative when declared (last-assignment
  wins; computed forms fall back to the underscore-prefix heuristic;
  classes listed in `__all__` have their members recursively walked).
- Implicit `__init__` is tracked so adding a required-arg constructor
  on a previously zero-arg class raises.

**TestChurnError** — when overwriting an existing non-empty test file
through code_generator_main, cmd_test_main, or one_session_sync, computes
the stdlib `difflib.unified_diff` churn ratio between pre and post test
bodies and raises when the ratio exceeds `PDD_TEST_CHURN_THRESHOLD`
(default 0.40; trailing `%` accepted; invalid values log a warning and
default). Deletion of a pre-existing test file is treated as maximal
churn (ratio=1.0).

Both gates accept anchored `BREAKING-CHANGE:` directives in the prompt
body. Lines must start at column 0 (after optional whitespace), with a
verb (`remove`/`rename`/`change signature`/`rewrite tests`/...) and a
comma-separated symbol list. The verb's object must adjoin the relevant
target (e.g. `rewriting tests` opts out churn; `rewrite docs and update
tests` does not). Bare markers, prose-only mentions, and mid-line
references do not opt out.

Both errors carry a `repair_directive` and route through the existing
`PDD_REPAIR_DIRECTIVE` plumbing. All four retry helpers
(`code_generator_main`, `cmd_test_main`, `_run_test_op_with_churn_retry`,
`run_one_session_sync`) follow the same contract:
- Pop `PDD_REPAIR_DIRECTIVE` before attempt 1 so stale outer env doesn't
  contaminate the first try.
- Loop-local directive variable holds the in-loop state; only set after
  catching a typed repair exception raised by THIS loop.
- `cmd_test_main` accepts an explicit `repair_directive` keyword
  parameter so retry helpers pass it directly rather than via env.
- The env-var write is preserved for cross-process inheritance into
  re-entrant PDD CLI subprocesses spawned by agentic generation.
- `finally` block restores the prior env value (pop if originally
  unset).

The gate fires on every code-write path in pdd sync:
- In-process generate (`code_generator_main`).
- Per-op orchestration (`crash_main`, `fix_verification_main`,
  `fix_main`) via `_verify_code_surface_after_write` — on regression,
  cost is charged and the operation loop breaks (no infinite loop on
  repeatable verify regressions).
- One-session agentic path (`run_one_session_sync`) — surface check runs
  before churn check on each attempt; both files (code and test) are
  restored on exhaustion.

826 tests pass across the changed module set
(`test_code_generator_main`, `test_cmd_test_main`,
`test_sync_orchestration`, `test_sync_main`, `test_one_session_sync`,
`test_agentic_sync_runner`, `commands/test_maintenance`).

Closes #1012
Closes #1030

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n gates

Addresses the 2 Blocker + 2 Important findings from codex review iter-1 of
PR #1015. All findings have a focused regression test added.

- Blocker 1 (binding kind): _snapshot_public_signatures now prefixes each
  class-body method signature with [instance]/[staticmethod]/[classmethod]/
  [property] so a method-binding flip (def f(self, v) -> @staticmethod def
  f(v)) is detected as a signature change even when the receiver-stripped
  params match. Previously Class.f(1) callers could break silently.

- Blocker 2 (YAML front matter): defensive hardening that strips a leading
  YAML --- ... --- front matter block before BREAKING-CHANGE: parsing, so
  an indented opt-out buried in metadata cannot whitelist surface removals
  or test-churn. Verified no shipped prompt currently uses front matter
  (pdd/prompts/ scan), so this guards against a future convention rather
  than fixing a current exploit.

- Important 1 (contract docs): tightens the docstring on
  _verify_code_surface_after_write and the matching README sentence to
  make explicit that crash/fix/verify surface failures are hard failures
  (no PDD_REPAIR_DIRECTIVE retry); the repair loop applies to generate
  and one-session paths only. Backed by a regression test that asserts
  the gate runs exactly once and returns the typed error without retry.

- Important 2 (retry budget): splits the one-session retry accounting so
  public-surface repair uses MAX_CONFORMANCE_ATTEMPTS (parity with the
  generate path) while test-churn keeps its tighter _MAX_TEST_CHURN_ATTEMPTS
  cap. Previously a single min() shared the smaller cap across both gates,
  cutting surface repair off at 2 attempts. Backed by two tests: one
  asserts surface attempts use the full budget, the other pins the churn
  cap at 2.

Test coverage:
- 10 new tests in tests/test_code_generator_main.py covering binding-kind
  detection (instance/staticmethod/classmethod/property flips) and YAML
  front-matter stripping (with unterminated-fence and empty-prompt edge
  cases).
- 2 new tests in tests/test_sync_orchestration.py pinning the hard-failure
  contract of _verify_code_surface_after_write.
- 2 new tests in tests/test_one_session_sync.py exercising the full
  MAX_CONFORMANCE_ATTEMPTS budget for surface repair and the
  _MAX_TEST_CHURN_ATTEMPTS cap for churn repair.

Local pytest run: 570 passed, 0 failed across
test_code_generator_main.py, test_sync_orchestration.py,
test_one_session_sync.py, test_cmd_test_main.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n gates

Address three new blockers introduced by the iter-1 follow-up commit on
PR #1015:

* `_strip_yaml_front_matter` now tolerates a leading UTF-8 BOM, CRLF /
  mixed line endings, trailing whitespace on the fence line, and a
  closing fence at EOF (no trailing newline). The previous regex only
  matched `---\n...---\n`, so a Windows-saved prompt could leave
  `BREAKING-CHANGE:` metadata visible to the directive parser and
  silently opt out of the gates.

* `_snapshot_public_signatures` now accumulates property accessor
  roles per class member before recording the final entry. Previously
  `@x.setter def x(self, value)` overwrote the `@property` getter in
  the signatures dict and was misclassified as `[instance] (value)`,
  so a rewrite that replaced the whole descriptor with a plain
  `def x(self, value)` produced an identical snapshot and bypassed
  the gate. The merged entry is now `[property:<sorted-roles>]
  <getter-signature>`, distinct from any plain-method snapshot.

* Top-level public symbols now carry a kind prefix (`[function]`,
  `[async_function]`, `[class]`) so swapping a public class with a
  same-named function (or a sync function with `async def`) is
  detected even when the receiver-stripped parameter list happens to
  match. The class self-entry inside `_walk_class` mirrors this with
  `[class]`, applying transitively to nested classes.

Adds 17 tests across `TestPublicSurfaceBindingKind` and
`TestFrontMatterStripping` covering each new edge case (property /
property+setter / setter-loss flips, class/function/async flips, CRLF /
BOM / EOF-terminated / trailing-whitespace fences, and end-to-end
verification that buried directives no longer opt out of the removal
gate).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Top-level `assignment ↔ def/class` kind flips were asymmetric: the
iter-2 fallback caught `def/class → assignment` (signature dropped,
surface kept), but `assignment → def/class` left the before-side
without a signature entry — the regenerated `def Foo()` looked like a
new symbol, not a kind flip. `_snapshot_public_signatures` now also
records public top-level `Assign` / `AnnAssign-with-value` targets as
`[assignment]`, honoring `__all__` via the existing
`_is_public_top_level` predicate. Both flip directions hit the primary
`changed_set` diff path.

Adds 4 regression tests in `TestPublicSurfaceBindingKind` covering
both flip directions and the unchanged-assignment sanity check, plus
an inline note in `_walk_class` acknowledging the known
class-body-redefinition limitation flagged by codex as a debatable nit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address four findings from the external reviewer:

1. Blocker (pdd/one_session_sync.py): `except TestChurnError` now
   restores the pre-session code file in addition to the test file,
   mirroring the `PublicSurfaceRegressionError` handler's pattern. A
   failed one-session attempt previously left mutated code on disk
   even when the test rewrite was rejected, so subsequent `pdd sync`
   runs treated the mutated code as the baseline.

2. High (pdd/code_generator_main.py): `_format_python_signature` now
   emits a literal `/` token between positional-only and regular args
   so `def f(x, /, y)` and `def f(x, y)` produce DIFFERENT snapshots.
   Previously both forms snapshotted as `(x, y)` and the public-surface
   gate missed a real ABI break (kwarg callers succeed against the
   second form, fail against the first). `skip_first` strips from the
   posonly group when present, otherwise from `args.args`, so the
   marker insertion index stays correct.

3. Medium (pdd/prompts/sync_orchestration_python.prompt): the
   non-generate code-write surface gate bullet now says `break`-ing
   the operation loop (matching the actual code at
   sync_orchestration.py:~2825) and mirrors the inline rationale
   that surface regression is a contract violation, not a transient
   failure — `continue` would let `sync_determine_operation` re-select
   the same operation and spin indefinitely on deterministic bad
   output.

4. High (architecture.json): refreshed signature strings for
   `update_file_pair` (added `strength`/`temperature` params) and
   `commit_and_push` (now takes `Sequence[DriftInfo]` with defaults
   and the new `finalized_modules` parameter) to match the live
   function signatures.

Tests:
- TestOneSessionRollback::test_test_churn_restores_code_and_test
  pins the blocker fix: mock rewrites both code and test, asserts
  both files restored after TestChurnError propagates.
- TestPublicSurfaceBindingKind: four new tests for the posonly fix
  (added/removed/unchanged + a direct snapshot-string assertion).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three edge cases reviewer probes found after the iter-3 fix landed:

* `_snapshot_public_signatures` now records top-level imports as
  `[import]` / `[import:<module>]` / `[import:from <module>:<source>]`
  entries (parity with the surface helper). Without this entry, a
  silent re-export break like `from pathlib import Path` -> `def
  Path(): pass` produced no snapshot diff because the surface set
  unchanged while the signatures dict treated the new function as a
  brand-new symbol.
* `_walk_class` now synthesises a constructor signature from class-
  level `AnnAssign` annotations when an `@dataclass` / `@dataclasses.
  dataclass` decorated class has no explicit `__init__`. Adding a
  required field (`name: str` -> `name: str; age: int`) now flips the
  snapshot. `ClassVar` / `InitVar` annotations are excluded; explicit
  `__init__` still takes precedence. Third-party `attrs` / `pydantic`
  decorators are out of scope (documented).
* `build_public_surface_hard_failure_from_error` and
  `build_test_churn_hard_failure_from_error` now include the
  `BREAKING-CHANGE:` opt-in instruction line required by item 9d of
  the runner prompt.

Tests cover the four import-flip variants, six dataclass scenarios,
and two hard-fail string format checks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses 4 review findings on commit 6523cfa:

- High: dataclass init synthesizer now honours @DataClass(kw_only=True),
  the in-body `_: KW_ONLY` sentinel, and InitVar[...] fields. The
  sentinel check runs before the underscore-prefix skip so the canonical
  marker is recognised; decorator wins over sentinel (single `*`); a
  trailing `*` without kw-only fields is stripped.
- Medium: fields declared with `field(init=False, ...)` (bare or
  `dataclasses.field` attribute form) are excluded from the synth,
  matching the stdlib runtime. `init=True`, missing `init` kwarg, and
  non-`field` calls still include the field.
- Medium: AsyncSyncRunner's internal `_build_public_surface_hard_failure`
  and `_build_test_churn_hard_failure` blocks now carry the
  BREAKING-CHANGE: opt-out instruction so end users see the directive
  on the subprocess hard-failure path (not just the importable
  builders).
- Medium: prompt and architecture.json describe the new dataclass synth
  rules (kw_only / KW_ONLY / InitVar included; ClassVar + init=False
  excluded; explicit __init__ wins) plus the kind-prefix invariant and
  import-binding formats so prompt-driven regeneration preserves them.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two more dataclass edge cases probed by the external reviewer on
``93d7fe69d``:

1. ``@dataclass(init=False)`` was ignored. The synth unconditionally
   built a constructor signature from class-body ``AnnAssign`` nodes
   when a dataclass decorator was present, but ``init=False`` keeps
   the class's natural (zero-arg) ``__init__``. Now detect the kwarg
   on Call decorators (``@dataclass(init=False)`` and
   ``@dataclasses.dataclass(init=False)``) and emit ``()`` instead of
   walking the fields. Flipping ``init=False`` -> default trips the
   gate; adding fields under ``init=False`` does not.

2. Inherited dataclass fields were not merged into the synthesised
   constructor signature. Now walk same-module ``Name`` bases left-
   to-right depth-first with a cycle guard, prepending each
   ``@dataclass`` base's fields (and ITS bases' fields, recursively)
   in front of the derived class's own fields. Override semantics:
   when both base and derived declare the same field name, the
   derived's annotation/default replaces the base's, preserving the
   base's slot via insertion-order dict semantics — matching what
   ``@dataclass`` actually synthesises.

   Cross-module / attribute-form bases (``class User(pkg.Base)``)
   that we cannot resolve from this module's AST get an
   ``[inherited_unresolved]`` marker prepended to the synth signature
   so local field changes still diff while invisible base changes
   correctly slip through.

V1 limitations (documented in the helper docstring):

- A base decorated with ``@dataclass(init=False)`` still contributes
  declared fields to a derived ``@dataclass``'s synth at runtime, but
  we treat such bases as non-dataclass for the purposes of inherited-
  field merging.
- The ``KW_ONLY`` sentinel does not propagate across the inheritance
  boundary — each class re-emits its own kw-only context.

Tests in ``TestPublicSurfaceBindingKind``:

- ``test_dataclass_init_false_decorator_flip_to_default_is_detected``
- ``test_dataclass_init_false_decorator_keeps_zero_arg_signature``
- ``test_dataclasses_dataclass_init_false_form_works``
- ``test_dataclass_inherited_field_added_to_private_base_is_detected``
- ``test_dataclass_unchanged_inheritance_is_allowed``
- ``test_dataclass_unresolved_base_is_marked_uncertain``
- ``test_dataclass_multiple_inheritance_left_to_right``
- ``test_dataclass_explicit_init_still_takes_precedence_with_inheritance``

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Walk dataclass bases in reverse-MRO order so the synthesised
  ``__init__`` matches Python's ``@dataclass`` semantics
  (``__dataclass_fields__`` is populated by reversed iteration so
  later bases override earlier ones in dict insertion order). For
  ``class C(A, B)`` the synth becomes ``(b, a, c)``.
- Stop skipping ``@dataclass(init=False)`` bases: their fields STILL
  live in ``__dataclass_fields__`` and the derived class merges them.
  ``init=False`` only suppresses the base's OWN init synthesis.
- Update the existing multiple-inheritance test (renamed to
  ``test_dataclass_multiple_inheritance_mro_order``) and add diamond,
  single-inheritance regression, and init=False-base inheritance
  tests.
- Document reverse-MRO ordering and init=False inheritance in the
  prompt and architecture.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Mirror prompt deps into architecture.json for update_main (was
  2/9) and ci_drift_heal (was 7/9) so global sync can topologically
  order stale dependents correctly.
- Surface repair_directive (#1012 F-H) on the canonical signatures
  for cmd_test_main and run_agentic_test_generate in
  architecture.json, plus the cmd_test_main_python.prompt Inputs
  section and the agentic_test_generate_python.prompt Entry Point.
  Also fix run_agentic_test_generate's pre-existing stale 4-arg
  Tuple[bool, str, str] signature in architecture.json.
- Honor the anchored BREAKING-CHANGE prompt opt-out (parsed by
  _prompt_allows_test_churn) in the deletion-as-churn branches at
  cmd_test_main.py:230 and one_session_sync.py:609, closing the
  asymmetry where the prompt opt-out worked for wholesale rewrites
  but silently failed when the agent emptied/deleted the test
  file. Import the helper into both files and add regression tests
  in test_cmd_test_main.py and test_one_session_sync.py.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address three findings from the external review of the public-surface
and test-churn gates:

1. P1 — `_is_test_output_path` now also matches `<name>_test.go` /
   `<name>_test.rb` / `<name>_spec.rb` so Go and Ruby sibling test
   files no longer bypass the churn gate when they live next to
   production code instead of under `tests/`.

2. P1 — `_snapshot_public_surface` skips `from __future__ import ...`
   directives. Removing the directive after a Python-version bump is
   harmless cleanup, but the gate previously flagged `annotations`
   (and any other future name) as a removed public symbol.

3. P2 — `_verify_public_surface_regression` expands BREAKING-CHANGE
   class-removal directives to cover every captured descendant. With
   `BREAKING-CHANGE: remove Service` the caller no longer has to
   enumerate `Service.run`, `Service.Inner.go`, etc. Signature-change
   directives stay strict — auto-expansion there would silently mask
   real regressions.

Additionally:
- sync_main retry message: report "Public surface regression" /
  "Test churn threshold exceeded" instead of mislabeling them as
  "Architecture conformance failed".
- README: document the full public-surface coverage (constants and
  re-exported imports) and the new class-removal descendant rule.
- CHANGELOG: restore the v0.0.238 Chore/Test subsections the original
  feat commit accidentally dropped.

Regression coverage added in TestSyncCompatibilityGates: Go/Ruby
sibling-file churn, `__future__` import ignored by the gate, and
`BREAKING-CHANGE: remove Class` covering nested descendants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address three follow-up findings from the next review pass:

1. P1 — `_snapshot_public_signatures` also skips `from __future__
   import …`. Without this the surface helper would return early but
   the signatures dict still records `[import:from __future__]`, so a
   regenerated module that drops the directive and binds a real
   `annotations = …` value would falsely diff as a
   `[import:from __future__]` → `[assignment]` signature change.

2. P1 — `_is_test_output_path` now also matches PascalCase JVM-style
   suffixes: `Test.java`, `Tests.java`, `Test.kt`, `Tests.kt`,
   `Spec.kt`. The agentic test prompt already names `Test.java` as a
   recognised convention; without this match `src/test/java/FooTest.java`
   bypassed the churn gate. Case-sensitive on purpose so `latest.kt`,
   `manifest.java`, `request.scala` do not false-positive.

3. P1 — Prompt drift: `pdd/prompts/code_generator_main_python.prompt`
   now documents the `__future__` skip in BOTH `_snapshot_public_surface`
   and `_snapshot_public_signatures`, the full PascalCase / sibling-
   file test-suffix list in the test-churn gate description, and the
   `BREAKING-CHANGE: remove <Class>` descendant-expansion rule. A
   future `pdd sync code_generator_main` can now regenerate without
   reverting iter-9 + iter-10.

Nit:
- `run_agentic_test_generate` annotated return type was a stale
  `tuple[str, float, str, bool]` (and the docstring claimed a 4-tuple)
  while the actual return is `TestResult` (5 fields). Updated both.

Regression coverage added in TestSyncCompatibilityGates:
- `from __future__ import annotations` rebinding to a real
  `annotations = {...}` value (the exact reviewer-named scenario).
- `FooTest.java` and `WidgetSpec.kt` churn detection.
- `latest.kt` NOT classified as a test (PascalCase false-positive
  regression guard).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-up findings from the next review pass:

1. P1 — Expand `_is_test_output_path` to cover the remaining common
   test-file conventions the gate was skipping: Maven failsafe
   integration tests (`FooIT.java`), older JUnit/TestNG TestCase
   convention (`FooTestCase.java`), ScalaTest (`Test.scala`,
   `Tests.scala`, `Spec.scala`), Swift (`Tests.swift`), .NET xUnit/
   NUnit (`Test.cs`, `Tests.cs`), and Rust sibling tests
   (`_test.rs`). All PascalCase JVM/.NET/Swift entries stay
   case-sensitive so `latest.kt`/`manifest.java`/`request.scala`
   continue NOT to false-positive.

2. P1 — Close the alt-path test churn loophole: when
   `run_agentic_test_generate` falls back to reading an alternate
   path (e.g. `__tests__/foo.test.ts`) because the canonical
   `output_test_file` is absent, recover the alt path's pre-existing
   content via `git show HEAD:<rel>` and re-run the test-churn check
   against THAT baseline (not the empty canonical). On
   `TestChurnError`: attach `total_cost`/`model_name` and restore the
   pre-existing content to disk so the repair loop sees the canonical
   pre-sync state. Untracked / never-committed alt files harmlessly
   return empty and fall through to first-time-generation behavior.
   Limitation: when the agent writes to BOTH the canonical output
   AND an alt path, only the canonical churn is gated; the alt-path
   delta in that dual-write case is still unguarded. The reviewer's
   named scenario (canonical absent, alt rewritten) is fully covered.

3. P1 — `_collect_bound_module_names` also skips `from __future__
   import …` so `__all__ = ["annotations"]` no longer promotes the
   future binding into the public surface via the `__all__`-
   authoritative branch. Without this skip a clean-up that drops
   `from __future__ import annotations` AND the matching `__all__`
   entry would falsely diff as `removed: annotations`.

Prompt sync:
- `code_generator_main_python.prompt`: full extended test-suffix list
  in the test-churn gate description, and the parallel `__future__`
  skip in `_collect_bound_module_names` documented alongside the
  earlier skip notes for `_snapshot_public_surface` and
  `_snapshot_public_signatures`.
- `agentic_test_generate_python.prompt`: documents the alt-path
  git-recovery churn check, the `_read_prior_content_from_git`
  helper contract, and the restore-on-failure requirement.

Regression coverage:
- `FooIT.java`, `FooTestCase.java`, `FooSpec.scala`, and
  `widget_test.rs` churn detection.
- `__all__ = ["annotations"]` + `from __future__ import annotations`
  removal MUST NOT trip the gate.
- `run_agentic_test_generate` alt-path rewrite with a tracked prior
  (mocked `_read_prior_content_from_git` returning 20 tests) raises
  `TestChurnError` AND restores the prior content; untracked prior
  (mocked empty) falls through with no raise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…shot; broaden test-file extension coverage

Structural pivots in response to the next external review pass —
the iterative "add named language" / "fall back to HEAD" pattern
was diverging, so both fixes change the underlying approach:

F1+F2 — Replace `git show HEAD` recovery with a working-tree
content snapshot. `run_agentic_test_generate` now snapshots every
existing file under `project_root` matching `_is_test_output_path`
BEFORE the agent runs (bounded by per-file size of ~1 MiB) and uses
that snapshot as the alt-path churn baseline. This covers all three
prior-state shapes uniformly:
  - tracked-clean (matched HEAD's content anyway)
  - tracked-dirty (HEAD-based restore would have silently clobbered
    local edits; snapshot preserves the working-tree content)
  - untracked (HEAD-based recovery returned empty, treating
    pre-existing coverage as first-time generation; snapshot picks
    it up correctly)
The `_read_prior_content_from_git` helper and the `subprocess`
import are removed — replaced by `_snapshot_pre_test_contents`,
which lazy-imports `_is_test_output_path` to avoid the circular
dependency.

F3 — Replace the per-language `sibling_test_suffixes` tuple with a
derived check against `_LANGUAGE_TEST_FILE_EXTS` (module-level
tuple seeded with `.py`, `.go`, `.rb`, `.rs`, `.exs`, `.ex`,
`.dart`, `.clj`, `.cljc`, `.lua`, `.php`). `_is_test_output_path`
now matches `_test.<ext>` / `_spec.<ext>` for every extension in
that set, so Elixir, Dart, Clojure, Lua, and PHP are covered in
one go — and a future language addition to `language_format.csv`
only needs one tuple-append, not another `_is_test_output_path`
fix round. PascalCase JVM/.NET/Swift suffixes stay enumerated
(case-sensitive on `Test`/`Tests`/`Spec`/`IT`/`TestCase` to dodge
`latest.kt`/`manifest.java`/`request.scala`/`latest.groovy` false
positives) and pick up Groovy's `Spec.groovy`/`Test.groovy`/
`Tests.groovy` (Spock framework).

Prompt + architecture sync:
- `code_generator_main_python.prompt`: documents the two-family
  test-suffix shape (data-driven `_LANGUAGE_TEST_FILE_EXTS` vs
  enumerated PascalCase) with an instruction to append the tuple
  when a new language is added rather than patching the gate.
- `agentic_test_generate_python.prompt`: documents
  `_snapshot_pre_test_contents` and the restore-on-failure
  contract; the stale `_read_prior_content_from_git` /
  `git show HEAD` notes are gone.
- `architecture.json`: `run_agentic_test_generate` sideEffects
  describe the snapshot contract; `_snapshot_pre_test_contents`
  listed as a module helper.

Regression coverage:
- `FooSpec.groovy` (Spock) and `Greatest.groovy` false-positive
  guard in `TestSyncCompatibilityGates`.
- `_test.exs` / `_test.dart` / `_test.clj` / `_spec.lua` churn
  detection (Elixir/Dart/Clojure/Lua arriving via the derived
  extension set, not extra code paths).
- `run_agentic_test_generate` alt-path tests rewritten to exercise
  the real snapshot code path (no git mocking): untracked prior
  raises; tracked-dirty prior raises and restore preserves dirty
  edits; brand-new alt path falls through to first-time-generation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sting files

BLOCKING: the conformance and public-surface / test-churn gate-call
sites in `code_generator_main` previously gated on `and
generated_code_content` (a truthy check), which evaluated False for
empty string. When the LLM provider returned "", every gate was
skipped and the writer further down then ran
`p_output.write_text(final_content)` — silently truncating an
existing public module / test file to 0 bytes with NO
`PublicSurfaceRegressionError` / `TestChurnError` raised. Reviewer
reproduced this with mocked local generation: existing public module
and `tests/test_existing.py` both went to 0 bytes.

Fix in three parts:

1. Gate condition — change `and generated_code_content` to
   `and generated_code_content is not None` at the architecture-
   conformance gate-call (line 4022) and at the public-surface /
   test-churn gate-call (line 4043). With `is not None` an empty
   generation still runs the gates: `_snapshot_public_surface("")`
   returns `set()` so every prior symbol surfaces as removed
   (PublicSurfaceRegressionError); `_compute_test_churn_ratio(...)`
   returns 1.0 against an empty rewrite so any non-empty test file
   trips TestChurnError; `_verify_architecture_conformance` finds
   no symbols so every declared arch entry surfaces as missing.
   The existing compat-gate restore branch then re-writes
   `existing_code_content` to disk so the repair loop sees the
   pre-sync baseline.

2. Safety guard — for file types neither gate can inspect (non-
   Python source, JSON, YAML, prompt files, etc.) the writer now
   refuses to overwrite non-empty existing content with empty or
   whitespace-only generated content, raising a
   `click.UsageError("Refusing to overwrite ...")`. Empty output is
   almost always an upstream provider/parse failure, and silent
   truncation loses real work. `PDD_ALLOW_EMPTY_GENERATION=1` is
   the explicit escape hatch for the rare intentional case.

3. Documentation — prompt + README sync the empty-generation gating
   contract and the new env flag so a future `pdd sync
   code_generator_main` does not regenerate away the fixes and
   users hitting the safety guard discover the bypass flag.

Regression coverage (three named end-user surfaces):
- empty generation over existing Python public module →
  `PublicSurfaceRegressionError`, file restored to pre-sync content.
- empty generation over existing `test_existing.py` →
  `TestChurnError`, file restored to pre-sync content.
- empty generation over existing `config.yaml` → safety-guard
  `click.UsageError("Refusing to overwrite ...")`, file untouched.

The branch is rebased onto current `origin/main` (one
auto-generated test-durations conflict resolved by taking main's
version), so the PR now presents a clean linear history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecture.json

External review iter-13 docs nit: the empty-generation safety guard
added in 13cb692 and the new `PDD_ALLOW_EMPTY_GENERATION` escape-
hatch env flag are documented in README.md and the
`code_generator_main_python.prompt` body, but architecture.json's
`code_generator_main` description and CHANGELOG.md still did not
mention them.

- architecture.json: extend the existing `code_generator_main`
  description (which already names `PDD_TEST_CHURN_THRESHOLD` /
  `PDD_REPAIR_DIRECTIVE`) with the `is not None` gating contract,
  the non-Python safety guard, and the `PDD_ALLOW_EMPTY_GENERATION`
  bypass — keeping the env-flag inventory complete for downstream
  consumers that read architecture.json as the API/behavior catalog.
- CHANGELOG.md: add an "Unreleased" section summarising the PR
  #1015 / #1012 gate feature with the full env-flag list so the
  next release-notes bump can promote it verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cloud-test run on 792e7d7 surfaced two failures (75 passed, 2
failed of 77 tasks); both addressed here so the next run goes
clean:

1. Stub mismatch — `tests/test_e2e_issue_342_syspath_isolation.py`
   monkeypatches `pdd.agentic_test_generate.run_agentic_test_generate`
   via a `fake_run_agentic_test_generate` stub. The real signature
   gained a `repair_directive: str | None = None` kwarg in iter-9
   (#1012, F-A / F-H), but the stub still had the pre-iter-9
   signature, so any code path that forwarded the new kwarg failed
   with `TypeError: ... got an unexpected keyword argument
   'repair_directive'`. Stub now accepts the kwarg (default `None`).

2. Test-churn gate false-positive in `tests/regression.sh` and
   `tests/cloud_regression.sh` — both scripts pre-existed PR #1015
   and seed small fixture files (~24-line `test_simple_math.py`)
   as prerequisites. When the gate-bearing `pdd test` then
   regenerates against that small fixture, the resulting test
   suite naturally exceeds the 40% churn threshold and raises
   `TestChurnError` (exit code 2). These regression scripts
   validate baseline CLI behavior — that `pdd
   generate/test/example/fix/...` succeed and produce output —
   NOT the new gates' behavior (the gates have their own targeted
   pytest suites). Exporting `PDD_SKIP_TEST_CHURN_GATE=1` and
   `PDD_SKIP_PUBLIC_SURFACE_GATE=1` at the top of each script
   opts the run out of the new gates without touching the
   `PDD_SKIP_CONFORMANCE` umbrella (so architecture conformance
   still helps if it ever fires through this surface). A
   multi-line comment in each script explains why so a future
   regression-script editor doesn't unwind the bypass.

3. Refresh `ci/cloud-batch/test-durations.json` from this run's
   junitxml (348 files / 349 total). Updated automatically by
   `submit.sh` post-job; included so the next cloud-test schedule
   benefits from the new measurements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-updated by ci/cloud-batch/submit.sh after the post-iter-14
cloud-test run (77/77 passed, 0 failed). Including so the next
cloud-test schedule chunks tests with the latest measurements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e (Greg review #1015)

Greg's review of PR #1015 surfaced a false negative in the
agentic test-churn gate: when the canonical output file exists with
content AND the agent rewrites a separate pre-existing test-like file
(e.g. shrinks `__tests__/widget.test.ts` from 20 tests to 1) while
leaving the canonical untouched, no `TestChurnError` was raised.

Root cause — `pdd/agentic_test_generate.py`:
The alt-path branch (`if not generated_content and changed_files`)
only ran a churn check on alt-path files when the canonical was
EMPTY. With canonical present and unchanged, the alt-path scan was
skipped entirely; `cmd_test_main`'s outer churn check then compared
the (unchanged) canonical against itself and trivially passed.

Root cause — `pdd/one_session_sync.py`:
The churn gate compared only `test_path` (canonical). The one-session
agent runs under a mega-prompt with no scope constraint and can touch
any test-like file in the project, so the same false negative applied.

Fix:

1. `pdd/agentic_test_generate.py`: replace the alt-path-only churn
   check with a broad sweep over every changed test-like file that
   existed pre-run (snapshot already taken via
   `_snapshot_pre_test_contents`). Restore EVERY pre-run snapshot
   before raising so the retry loop's next attempt re-snapshots from
   the true pre-run baseline — restoring only the violating file
   would let attempt-N rewrites become attempt-N+1's new baseline and
   permanently weaken the gate.

2. `pdd/one_session_sync.py`: add a pre-session snapshot of every
   test-like file in the project (lives outside the retry loop, single
   source of truth across attempts). Inside the existing churn gate
   try-block, sweep changed test-like files against the snapshot
   before running the canonical-only check. Skip the canonical path
   in the sweep so the canonical's deletion-as-churn shortcut and
   repair directive remain intact. The opt-out env flags and prompt-
   side `BREAKING-CHANGE: rewrite tests` were moved above the
   canonical-empty early-return so they apply to the alt-path sweep
   too. The except handler now restores every snapshot in addition to
   the canonical + code paths.

Test coverage:
- `tests/test_agentic_test_generate.py`:
  - `test_canonical_present_alt_path_rewrite_raises_churn_error`
    (Greg's repro: canonical unchanged, alt-path 20→1)
  - `test_canonical_present_multiple_alt_paths_all_restored_on_failure`
    (restore-all invariant)
- `tests/test_one_session_sync.py::TestOneSessionRollback`:
  - `test_alt_path_rewrite_with_canonical_unchanged_raises_churn_error`
    (analogous one-session repro)

Both new test bodies were confirmed to FAIL against the pre-fix
code (via `git stash` of the patched module) and PASS against the
patched code. Existing 12-scenario gate matrix remains 12/12 PASS.

Closes Greg-review blocking concern on PR #1015.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gltanaka
Copy link
Copy Markdown
Contributor

Follow-up review on updated head 20eec1b: still not merge-ready.

The iter-15 update fixes the previous rewrite false negative, but the same multi-file gate still misses deletion of pre-existing alternate test files.

Blocking scenario:

  • canonical test file exists and is unchanged, e.g. tests/widget.test.ts
  • pre-existing alternate test file exists with broad coverage, e.g. __tests__/widget.test.ts
  • agent deletes the alternate test file and reports success

Observed locally against 20eec1b:

  • run_agentic_test_generate(...) returned agentic_success=True
  • the alternate test file remained deleted
  • no TestChurnError was raised

The one-session path has the same issue: run_one_session_sync(...) returned success (PDD_PHASE: synced) while a pre-existing src/widget_test.py alternate test file was deleted.

Root cause:

  • pdd/agentic_test_generate.py skips changed files that no longer exist (if not changed_path.exists() or not changed_path.is_file(): continue) before checking the pre-run snapshot.
  • pdd/one_session_sync.py explicitly lets deleted alt-path files fall through in the snapshot sweep; only canonical deletion is treated as churn.

Required changes before merge:

  1. Treat deletion of any pre-existing snapshotted test-like file as maximal churn (post_text="", ratio 1.0) unless the test-churn gate is skipped or the prompt has the explicit BREAKING-CHANGE: rewrite tests opt-out.
  2. Restore deleted alternate test files from the pre-run snapshot before raising, in both agentic_test_generate and one_session_sync.
  3. Add regression tests for canonical-present + alternate-path-deleted in both paths.

The current focused tests pass (tests/test_agentic_test_generate.py -k "alt_path or canonical_present", tests/test_one_session_sync.py -k "alt_path_rewrite_with_canonical or test_churn_restores_code_and_test") and compileall is clean, but they do not cover this remaining false negative.

…(Greg follow-up review #1015)

Greg's iter-15 follow-up review surfaced a residual false negative
in the multi-file churn sweep: the iter-15 sweep filtered out paths
that no longer existed (`if not snap_path.exists(): continue`), so an
agent could DELETE a pre-existing alternate test file while leaving
the canonical untouched and the gate would not fire. Symmetric to
the canonical-deletion shortcut already in place; the alt-path
deletion case was simply missing.

Root cause — `pdd/agentic_test_generate.py`:
The iter-15 sweep iterated `changed_files` and `continue`d when the
changed path no longer existed. `pre_test_contents` keys for files
the agent had deleted were never visited.

Root cause — `pdd/one_session_sync.py`:
The iter-15 sweep explicitly skipped non-existent snapshot keys
(`if not snap_path.exists(): continue`) with a comment saying the
canonical-deletion branch covered the canonical case — but that
branch only fires for the canonical, not alt-paths.

Fix:

1. Both sweeps now iterate `pre_test_contents.items()` directly so
   deletions are visited uniformly with rewrites. When the
   snapshotted file no longer exists post-agent, the sweep raises
   `TestChurnError(churn_ratio=1.0, post_line_count=0, ...)` with a
   deletion-specific repair directive mirroring the canonical
   shortcut. Honors the same opt-outs (`PDD_SKIP_TEST_CHURN_GATE`,
   `PDD_SKIP_CONFORMANCE`, prompt-side `BREAKING-CHANGE: rewrite
   tests`) — inline because the deletion branch raises directly
   without going through `_verify_test_churn`.

2. Restore loops now `mkdir(parents=True, exist_ok=True)` before
   `write_text`, so files (and parent directories) the agent deleted
   are recreated when the gate fires. Without this, restoration of a
   deleted file in a deleted directory would raise OSError and be
   swallowed, leaving the file lost.

3. `agentic_test_generate.py` also factors restoration into a
   `_restore_snapshots()` helper used by both the rewrite and
   deletion branches.

Test coverage:
- `tests/test_agentic_test_generate.py::test_canonical_present_alt_path_deletion_raises_churn_error`
  — Greg's repro: canonical unchanged, alt-path deleted by mocked
  agent. Asserts `TestChurnError(churn_ratio=1.0)` raised, alt-path
  restored from snapshot, canonical untouched.
- `tests/test_agentic_test_generate.py::test_alt_path_deletion_honors_skip_test_churn_gate_env`
  — `PDD_SKIP_TEST_CHURN_GATE=1` bypasses the deletion check (since
  it raises directly without `_verify_test_churn`).
- `tests/test_one_session_sync.py::TestOneSessionRollback::test_alt_path_deletion_with_canonical_unchanged_raises_churn_error`
  — Analogous one-session repro.

Each new test was confirmed to FAIL against the iter-15 code (via
git stash) and PASS against iter-16. Existing 989 focused tests +
12/12 scenario matrix continue to pass.

Closes the iter-15 deletion follow-up flagged on PR #1015.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gltanaka
Copy link
Copy Markdown
Contributor

Follow-up review on updated head f1355ba: deletion handling is fixed, but there is still one cleanup blocker in the one-session path.

Good news: I reran the prior deletion repros and both now behave correctly:

  • run_agentic_test_generate(...) raises TestChurnError with ratio=1.0 and restores the deleted alternate test file.
  • run_one_session_sync(...) raises TestChurnError with ratio=1.0 and restores the deleted alternate test file when churn is the first failing gate.

Remaining blocker: if a one-session attempt causes a public-surface regression AND deletes/rewrites an alternate test file in the same attempt, the public-surface gate fires first and the churn gate is skipped. The surface handler restores only the code file and canonical test file, so the alternate test deletion remains on disk after the failed sync.

I reproduced locally against f1355ba with:

  • pre-existing code containing public keep_me()
  • pre-existing canonical test unchanged
  • pre-existing alternate src/widget_test.py with 20 tests
  • mocked one-session agent removes keep_me() and deletes src/widget_test.py, then returns success

Observed result:

  • PublicSurfaceRegressionError is raised as expected
  • code file is restored
  • canonical test is restored/unchanged
  • alternate src/widget_test.py remains deleted

This still violates the PR's safety goal: a failed sync can discard broad existing test coverage as a side effect of a higher-priority surface failure.

Required change before merge:

  1. In the PublicSurfaceRegressionError handler inside run_one_session_sync, restore every snapshotted test-like file from pre_test_contents, just like the TestChurnError handler now does. This should recreate deleted alt-path tests and parent directories.
  2. Add a regression test where the same one-session attempt both regresses public surface and deletes or rewrites an alternate test file; assert the final raised error is PublicSurfaceRegressionError and all snapshotted test files are restored.

Focused checks I ran after this review:

  • pytest -q tests/test_agentic_test_generate.py -k "alt_path or canonical_present" -> 7 passed
  • pytest -q tests/test_one_session_sync.py -k "alt_path or test_churn_restores_code_and_test or surface" -> 13 passed
  • python -m compileall -q pdd/agentic_test_generate.py pdd/one_session_sync.py pdd/code_generator_main.py -> clean

… (Greg iter-16 follow-up)

Greg's iter-16 follow-up review surfaced a residual issue in
`run_one_session_sync`: when the same attempt regresses public
surface AND rewrites/deletes an alt-path test file, the public-
surface gate fires first (higher priority) and `continue`s before
the test-churn gate runs. The surface handler restored only the
code file + canonical test file, leaving the alt-path damage on
disk — silently discarding broad existing coverage as a side
effect of a higher-priority failure.

Root cause:
The `except PublicSurfaceRegressionError` handler in
`run_one_session_sync` restored `code_path` from
`existing_code_content` and `test_path` from `existing_test_content`
but had no equivalent of the iter-15/16 churn-handler loop that
walks `pre_test_contents.items()` and writes every snapshot back.

Fix:
Mirror the churn-handler restore loop in the surface handler.
Iterate `pre_test_contents.items()`, skip the canonical (already
restored above), and `mkdir(parents=True, exist_ok=True)` +
`write_text` every snapshotted alt-path test file — so both
rewrites AND deletions performed by the failed attempt are rolled
back regardless of which gate ultimately raised. Best-effort:
`OSError` swallowed per the existing pattern.

Test coverage:
- `tests/test_one_session_sync.py::TestOneSessionRollback::test_surface_failure_with_alt_path_deletion_restores_all`
  — Greg's repro: pre-existing public `keep_me()`, canonical test,
  alt-path `src/widget_test.py` with 20 tests. Mocked agent removes
  `keep_me()` AND deletes alt-path in the same attempt. Asserts
  `PublicSurfaceRegressionError` raised, code + canonical + alt-path
  ALL restored to pre-session bytes.

Confirmed the test fails against iter-16 (alt-path stayed deleted)
and passes against iter-17. Existing 990 focused tests + 12/12
scenario matrix continue to pass.

Closes Greg's iter-16 surface-handler restore-coverage follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gltanaka gltanaka merged commit 6e55191 into main May 23, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pdd-issue PDD: autonomous issue solving pdd-sync PDD: sync prompts with code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pdd sync over-regenerates mature modules and breaks public compatibility

3 participants