Skip to content

feat: add /pdd budget control comments for GitHub App runs (#1128)#1131

Open
prompt-driven-github[bot] wants to merge 26 commits into
mainfrom
change/issue-1128
Open

feat: add /pdd budget control comments for GitHub App runs (#1128)#1131
prompt-driven-github[bot] wants to merge 26 commits into
mainfrom
change/issue-1128

Conversation

@prompt-driven-github
Copy link
Copy Markdown
Contributor

Summary

Adds the full GitHub App control-comment surface for budget management: startup settings comments, /pdd budget, /pdd budget node, /pdd budget max, /pdd settings, /pdd stop, with reactive active-run updates and budget enforcement for both normal commands and pdd-issue defaults. Per @Serhan-Asad's clarification, this is intentionally broader than a generic cost watcher — it covers the public prompts/specs/server helpers/interfaces in this repo that drive App/executor behavior.

Closes #1128

Changes Made

Prompts Modified

  • pdd/prompts/server/jobs_python.prompt — thread budget_cap / node_budget / max_total_cap / node_count through Job and JobManager.submit; add update_budget; integrate cost_budget_watcher; introduce BUDGET_EXCEEDED status; trigger existing cancel path on cap hit.
  • pdd/prompts/server/models_python.prompt — add BudgetSettings, BudgetUpdateRequest, BudgetExceededMessage, SlashCommandResult models and a new job-status enum value.
  • pdd/prompts/server/routes/commands_python.prompt — new GET / POST /commands/jobs/{job_id}/budget endpoints; defaults wiring for pdd-issue ($80 per node, $400 max).
  • pdd/prompts/track_cost_python.prompt — formalize the cost-CSV read contract so external watchers can rely on the schema.

Prompts Created

  • pdd/prompts/cost_budget_watcher_python.prompt — daemon-thread CSV poller with update_cap, idempotent .stop(), no-op when cap is None.
  • pdd/prompts/server/budget_settings_python.prompt — per-job settings store, effective-cap formula (min(node_budget * node_count, max_total_cap)), pdd-issue defaults, validators.
  • pdd/prompts/server/slash_command_parser_python.prompt — pure parser for /pdd ... commands with fenced-block / bot-comment / dedupe handling and authorization helpers.
  • pdd/prompts/server/budget_comments_python.prompt — pure Markdown renderers matching the issue's literal startup/settings/stop strings.

Documentation Updated

Review Checklist

Next Steps After Merge

  1. Regenerate code from modified prompts in dependency order:
    ./sync_order.sh
    Or manually:
    pdd sync cost_budget_watcher
    pdd sync track_cost
    pdd sync budget_comments
    pdd sync budget_settings
    pdd sync slash_command_parser
    pdd sync cli
    pdd sync executor
    pdd sync auth
    pdd sync commands
    pdd sync files
    pdd sync fix
    pdd sync jobs
    pdd sync utility
    pdd sync app
    pdd sync __init__
    pdd sync checkup
    pdd sync connect
    pdd sync generate
    pdd sync maintenance
    pdd sync modify
    pdd sync analysis
    pdd sync models
    pdd sync websocket
    
  2. Run pytest -vv tests/ to verify functionality.
  3. Wire the new prompts/models/endpoints into the private GitHub App webhook handler (out-of-repo) so the public surface is exercised end-to-end.

Created by pdd change workflow

Adds the full GitHub App control-comment surface for budget management:
startup settings comments, /pdd budget, /pdd budget node, /pdd budget max,
/pdd settings, /pdd stop, with reactive active-run updates and budget
enforcement for both normal commands and pdd-issue defaults.

Prompt changes (MODIFY):
- prompts/server/jobs_python.prompt: thread budget_cap/node_budget/
  max_total_cap/node_count, add update_budget, integrate watcher, add
  BUDGET_EXCEEDED status.
- prompts/server/models_python.prompt: BudgetSettings,
  BudgetUpdateRequest, BudgetExceededMessage, SlashCommandResult, new
  enum value.
- prompts/server/routes/commands_python.prompt: GET/POST
  /commands/jobs/{job_id}/budget; defaults wiring for pdd-issue.
- prompts/track_cost_python.prompt: formalize read contract for watchers.

Prompt changes (CREATE):
- prompts/cost_budget_watcher_python.prompt: daemon-thread CSV poller
  with update_cap and idempotent stop.
- prompts/server/budget_settings_python.prompt: per-job store +
  effective-cap formula + pdd-issue defaults + validators.
- prompts/server/slash_command_parser_python.prompt: pure parser with
  fenced-block / bot / dedupe handling + authorization helpers.
- prompts/server/budget_comments_python.prompt: pure Markdown renderers
  matching the issue's literal strings.

Docs:
- README.md: new "GitHub App control comments" subsection.
- CHANGELOG.md: Unreleased entry referencing #1128.

Closes #1128

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Resolves architecture.json conflict by keeping main's updated description for
agentic_checkup_orchestrator while preserving this PR's added module entries
for cost_budget_watcher, budget_settings, slash_command_parser, and
budget_comments.
Closes the four spec-level bugs flagged on PR #1131:

Finding 2 (enforcement boundary): cost_budget_watcher and the README
claimed enforcement happens "between LLM/tool calls", but track_cost
only appends a row when a PDD subprocess exits. Documents enforcement
as a SUBPROCESS boundary, not a mid-call boundary; explains the
pdd-issue case (works well: many nested subprocesses) and the single-
command case (cap can be overshot by one subprocess's spend) so readers
do not over-claim.

Finding 3 (pdd-issue cost attribution): the watcher used to take a
single command=str filter. pdd-issue itself never writes a row with
command="issue"; it spawns nested PDD subprocesses (change, sync, bug,
fix, generate, test, ...) and each writes its own row. Filtering on
"issue" summed to $0 and the cap would never fire. Changes the
watcher to accept commands: Optional[Iterable[str]] (a set of accepted
command names) and documents in jobs_python.prompt which set to pass
for pdd-issue vs single-command jobs, with an explicit warning not to
reintroduce the {"issue"} bug.

Finding 4 (SlashCommandResult.metadata): the parser stored the validated
amount in a metadata dict, but the model did not define a metadata
field. Pydantic would either reject the extra field or drop the amount.
Adds metadata: Dict[str, Any] (default_factory=dict) to SlashCommandResult
in models_python.prompt and updates the parser doc to make the contract
explicit (always a concrete dict; {"amount": <float>} for budget-
mutating kinds, empty dict otherwise).

Finding 5 (auth gate inconsistency): parser R5 used to gate ALL /pdd
comments by is_authorized before parsing, which contradicted both the
README ("other commenters can use /pdd settings") and render_unauthorized
("refer the user to /pdd settings"). Rewrites R5 to parse first, then
gate by verb: budget-mutating verbs (budget_*, stop) require
is_authorized; the read-only /pdd settings verb is open to anyone whose
comment is parsed (subject to bot/fenced/dedupe filters). Updates the
README and budget_comments R7 to match.

Also reflects the watcher's command -> commands rename and the model's
new metadata field in architecture.json so the next pdd sync does not
flag drift.

Includes a clean merge of upstream/main; resolves the architecture.json
description conflict in agentic_checkup_orchestrator.

Finding 1 (no generated runtime code) is intentionally NOT addressed in
this commit: this PR was scoped as prompt-only with code generation in
the original "Next Steps After Merge" plan. See the PR reply for the
follow-up plan.
…pted-commands set

The reader-contract narrative said "Watchers filter the CSV by command"
(singular). The watcher's parameter is now commands (a set of accepted
names), so future regenerations of track_cost could bake the singular
assumption back in and silently reintroduce the pdd-issue zero-spend
bug. Documents the set form, and includes an explicit anti-pattern note
for the {"issue"} case.
@Serhan-Asad
Copy link
Copy Markdown
Collaborator

Thanks — all five findings hold up to primary-source verification. Findings 2–5 are now addressed in two commits on this branch (the merge of main, and the prompt-edit commit). Finding 1 is acknowledged below.

Finding 1 — prompt/docs only, no generated .py

Valid as scoped (intentional). The PR was scoped as a prompt/spec-level change, with code generation listed under "Next Steps After Merge". The current branch still ships only the prompt/doc/architecture.json diff; no pdd.cost_budget_watcher / pdd.server.budget_settings / slash_command_parser / budget_comments modules exist yet. Run-time behavior depends on (a) pdd sync regenerating the four new modules and updating jobs.py / models.py / commands.py, and (b) the private GitHub App webhook handler wiring the slash-command path to GitHub. (b) cannot land from this repo, so a follow-up PR scoped to (a) is the right unit to close the issue's runtime acceptance criteria; the prompts in this PR are the contract that generated code must satisfy.

The code-generation follow-up MUST also ship a real-LLM test file (per the project's prompt-only-PR convention) that asserts the generated parser preserves the Finding 4 (metadata.amount) and Finding 5 (verb-scoped auth) contracts, so a future pdd sync cannot silently regress them via LLM drift.

Finding 2 — CSV write timing vs "between LLM/tool calls"

Valid (architectural). track_cost writes the row from its finally block after the wrapped command returns (pdd/track_cost.py:60–131); a CSV-tail watcher cannot see spend inside a still-running subprocess. Fixed by being honest about the boundary: cost_budget_watcher_python.prompt, jobs_python.prompt, track_cost_python.prompt, and the README now describe enforcement as a subprocess boundary (works well for pdd-issue, whose executor spawns many nested PDD subprocesses; single long commands can overshoot the cap by exactly one subprocess's spend). The "between LLM/tool calls" claim is gone from README. Mid-call enforcement would require a separate live-cost signal (per-LLM-call flush in track_cost, or an in-process subscriber) — explicitly out of scope for this PR.

Finding 3 — pdd-issue cost attribution

Valid. pdd-issue never writes a row with command="issue"; nested subprocesses write rows with command in {change, sync, bug, fix, generate, test, ...} (pdd/track_cost.py:82). A command="issue" filter sums to $0. Fixed by changing the watcher signature from command: Optional[str] to commands: Optional[Iterable[str]] (a frozen set of accepted commands) and documenting in jobs_python.prompt exactly which set to pass for pdd-issue vs single-command jobs — with an explicit warning not to reintroduce the {"issue"} bug. track_cost_python.prompt's reader contract section and architecture.json updated to match.

Finding 4 — SlashCommandResult.metadata inconsistency

Valid. models_python.prompt did not declare a metadata field, but the parser stored the validated amount on metadata. Pydantic would have either rejected the extra field or dropped it. Fixed by adding metadata: Dict[str, Any] (with default_factory=dict) to the SlashCommandResult interface block AND the model documentation, and tightening the parser doc so it states explicitly that metadata is always a concrete dict ({"amount": <float>} for budget-mutating kinds, empty for the rest). architecture.json updated to match.

Finding 5 — auth gate vs /pdd settings

Valid. R5 used to gate ALL /pdd comments via is_authorized(...) before parsing, contradicting both the README ("other commenters can use /pdd settings for a read-only view") and render_unauthorized (which refers users to /pdd settings). Fixed by rewriting R5 to parse first, then gate by verb: budget-mutating verbs (budget_*, stop) require authorization; the read-only /pdd settings verb is open to anyone whose comment is parsed (subject to the existing bot / fenced-block / dedupe filters). README and render_unauthorized's R7 updated to match, and the auth gate is mentioned by name in the model doc so the design intent is visible in three places.

Process

  • Mergeable: CONFLICTING was caused by the divergence with main (description text for agentic_checkup_orchestrator in architecture.json). Resolved by merging main and keeping its updated description while preserving this PR's new module entries. Now MERGEABLE.
  • The previous auto-heal failure ran against the conflicting tree; CI is re-running against the new head.
  • Tests: this PR is still prompt-only by design. The full test cycle (pdd sync for each new module + the real-LLM contract test noted under Finding 1 + pytest -vv tests/) is the follow-up PR's responsibility.

Serhan-Asad and others added 19 commits May 22, 2026 11:36
… budget surface

Implements the four new public modules whose contracts the earlier prompt
edits defined, plus the matching updates to existing server modules and a
focused test suite. Brings the PR from spec-only to a runnable surface that
the private GitHub App webhook handler can wire to GitHub.

New modules
  - pdd/cost_budget_watcher.py
      Daemon-thread CSV poller. Accepts a set-based commands filter (not a
      single name) so pdd-issue runs sum spend across nested subprocess
      rows (change/sync/bug/fix/generate/test/...). R1-R7 honored:
      fire-once, cap-update fires only on next crossing, no-cap is a
      no-op, missing CSV is $0, .stop() is idempotent, never kills the
      subprocess itself, tolerant of partial CSV rows.
  - pdd/server/budget_settings.py
      effective_cap() formula (issue: min(node_budget * max(node_count or
      1, 1), max_total_cap); else: budget_cap), pdd_issue_defaults() ->
      (80.0, 400.0), validate_amount() with the $10000 hard ceiling, and
      a thread-safe BudgetStore.
  - pdd/server/budget_comments.py
      Pure Markdown renderers matching the issue's acceptance criteria
      verbatim (R10 money formatting; $<int> for integers, $<value:.2f>
      for fractions, always two decimals for 'Spent so far'). The
      render_unauthorized reply explicitly redirects rejected commenters
      to /pdd settings so the read-only verb's openness is visible.
  - pdd/server/slash_command_parser.py
      Pure parser. parse_comment never raises; classifies into
      budget_set / budget_node_set / budget_max_set / settings / stop /
      invalid / ignored. Sets metadata['amount'] on budget-mutating
      kinds; bare /pdd budget N on a pdd-issue job aliases to
      budget_max_set per R6. is_authorized is a stateless helper —
      authorisation is the webhook handler's job, scoped to mutating
      verbs only (R5).

Modified modules
  - pdd/server/models.py
      Adds JobStatus.BUDGET_EXCEEDED, BudgetSettings, BudgetUpdateRequest
      (with field validators that coerce '$30'/'30.00' to float and
      enforce > 0 and <= 10000), BudgetExceededMessage (WSMessage variant
      emitted exactly once on cap crossing), and SlashCommandResult with
      a concrete metadata: Dict[str, Any] field (default_factory=dict) so
      Pydantic never drops the parser's amount.
  - pdd/server/jobs.py
      Job dataclass gets optional budget fields. JobManager learns the
      submit budget_cap/node_budget/max_total_cap kwargs, update_budget
      (with KeyError/RuntimeError/ValueError exception types as the
      public contract), get_budget (snapshot for /pdd settings), and
      watcher wiring keyed on resolved cost-CSV path (job.options
      output_cost, falling back to PDD_OUTPUT_COST_PATH). On cap
      crossing, _handle_budget_exceeded reuses the cancel path that
      /pdd stop drives, marks status BUDGET_EXCEEDED, and emits the new
      on_budget_exceeded callback. Watcher is stopped in finally so it
      never outlives the job.
  - pdd/server/routes/commands.py
      Threads budget_cap/node_budget/max_total_cap from CommandRequest to
      JobManager.submit; applies pdd_issue_defaults() ($80/$400) when
      no explicit fields are present on a pdd-issue submission. Adds
      GET /commands/jobs/{job_id}/budget (KeyError -> 404) and
      POST /commands/jobs/{job_id}/budget (KeyError -> 404,
      ValueError -> 400, RuntimeError 'job not active: ...' -> 409),
      discriminating on exception type only — never on message text.

Tests
  - tests/test_budget_control.py: 73 unit tests covering effective_cap,
    validate_amount, BudgetStore, the parser (including Finding 3/4/5
    regression guards: filtering on {'issue'} stays $0; metadata.amount
    is set on budget kinds; /pdd settings returns kind='settings' with
    empty metadata and is NOT gated by auth at parse time), all comment
    renderers, and watcher behaviors (fire-once, cap update, no-cap
    no-op, started_at filter, missing CSV, idempotent stop).
  - tests/test_budget_control_real.py: @pytest.mark.real test that
    regenerates the parser via pdd generate and re-asserts the Finding
    4 (metadata.amount) and Finding 5 (read-only settings, R6 issue
    alias) contracts. Guards against LLM drift on future syncs.
  - tests/server/routes/test_commands.py: updated submit assertion to
    include the new budget kwargs (test was previously asserting the
    pre-budget call signature).

All 73 new tests + 499 existing server tests pass locally; full server
suite is clean. Watcher enforcement is documented as a subprocess
boundary (not 'between LLM/tool calls'), matching the README and
cost_budget_watcher / track_cost prompt updates already on this branch.
…, perf

Addresses follow-up review on the runtime budget surface; each fix ships
with a regression test that asserts the broken case stays broken before
the fix and passes after.

Finding 1 — naive vs aware timestamp comparison breaks enforcement
  track_cost writes timestamps via datetime.now().strftime('%Y-%m-%dT...'),
  which is NAIVE local time (track_cost.py wraps Click commands; the
  reader-contract docstring says UTC but the actual writer is naive). The
  watcher's job.started_at is timezone-aware UTC. Comparing the two
  raises TypeError, which the watcher caught and swallowed at the poll
  level — so every row was skipped, spend stayed at $0, and the cap
  never fired in real production CSVs.

  Fix: _parse_timestamp coerces naive CSV cells to UTC before comparison
  (matching what the reader contract documented all along); _normalize_-
  started_at does the same to the caller's started_at so naive callers
  also work. Regression test seeds a naive-format row that asserts
  spent() > 0.

Finding 2 — BUDGET_EXCEEDED status lost to CANCELLED race
  _handle_budget_exceeded used to call cancel() FIRST and then assign
  BUDGET_EXCEEDED. cancel() injects asyncio.CancelledError into the
  running task; that lands in _execute_job's except handler which set
  status=CANCELLED unconditionally. Depending on scheduling, the
  CancelledError handler ran AFTER our BUDGET_EXCEEDED assignment and
  silently demoted the status. /pdd settings and the final
  budget-exceeded comment then lost the actual reason.

  Fix: set status=BUDGET_EXCEEDED (and budget store snapshot) BEFORE
  calling cancel(). Add a module-level _TERMINAL_STATUSES set and gate
  every status assignment in _execute_job, _execute_wrapper, cancel(),
  and _on_task_done so they never overwrite a terminal status — so even
  if scheduling changes in the future, BUDGET_EXCEEDED is permanent
  once set. Async regression test submits a job with a sleeping fake
  executor, manually trips _handle_budget_exceeded, waits 500ms for
  the cancel race to play out, and asserts status remains
  BUDGET_EXCEEDED.

Finding 3 — POST /commands/execute rejects command='issue'
  ALLOWED_COMMANDS did not include 'issue', so the validation gate at
  the top of execute_command raised HTTP 400 before the
  pdd_issue_defaults branch could run. The GitHub App's documented
  call path (command='issue' with empty budget fields → defaults to
  $80/$400) was unreachable.

  Fix: add 'issue' to ALLOWED_COMMANDS with a description that names
  the defaults branch. Regression tests assert both (a) command='issue'
  with no budget fields applies the defaults and (b) command='issue'
  with an explicit node_budget skips them.

Finding 4 — full CSV reread every poll is O(rows × polls)
  Old _read_spent walked the entire DictReader every 2s. For a long
  pdd-issue run with hundreds of nested-subprocess rows, this is
  unbounded disk I/O per watcher.

  Fix: rewrite as a true incremental tail. Track _byte_offset of the
  first unread byte and a cached _known_size; on each poll, stat the
  file, seek to _byte_offset, parse only the new bytes, advance the
  offset, and add to a running _spent total. Partial trailing rows
  (a row being flushed) are left for the next poll (rfind newline,
  leave leftover bytes). Truncation/rotation (file shrinks) resets the
  tail state and re-scans from zero so the watcher can never get
  stuck pointing past EOF. Regression test patches csv.reader to
  count invocations, appends 5 rows over 5 polls, and asserts the
  reader call count stays bounded.

Finding 5 (nit) — CHANGELOG drift on enforcement boundary
  CHANGELOG still said "enforces the active cap between LLM calls"
  even after README/cost_budget_watcher/track_cost prompts had been
  updated to say subprocess boundary. Reworded to match.

All 79 budget tests + 544 server/budget tests pass locally; no other
tests touched.
…cap callback

Three follow-up review findings on the runtime budget surface; each fix
ships with a regression test that reproduces the broken behavior on the
prior code and passes after the fix.

Finding 1 — non-UTC workers may still drop current-time rows
  track_cost wrote naive local-time timestamps via datetime.now() and
  the watcher reinterpreted naive cells as UTC outright. A naive
  '11:30 local PDT' row was treated as '11:30 UTC' — which is 7 hours
  EARLIER than the job's actual UTC start, so the row was filtered out
  by the started_at gate and spend stayed at $0.

  Fix: track_cost now writes timezone-aware ISO via
  datetime.now(timezone.utc).isoformat(timespec='milliseconds'), so
  rows carry an explicit '+00:00' offset and there is nothing to
  reinterpret. The watcher's _parse_timestamp uses .astimezone() in
  both branches: aware values are converted to UTC for a uniform
  comparison frame; legacy naive cells (from older CSVs written before
  this fix) are treated as local time and converted to UTC — never
  reinterpreted as UTC outright. Existing track_cost regression
  patterns updated to allow the new '+00:00' suffix.

Finding 2 — initial budget fields skipped validate_amount
  CommandRequest and JobManager.submit accepted any float (negatives,
  > $10000, NaN, inf) on initial submission. Only update_budget ran
  validate_amount. A submission with budget_cap=-1 stored
  effective_cap=-1.0 and behaved as if there were a cap, but the
  arithmetic was nonsense.

  Fix: add a field_validator on CommandRequest.{budget_cap, node_budget,
  max_total_cap} that applies the same > 0 and <= $10000 rule as
  BudgetUpdateRequest, and string-form coercion ('$30', '30.00').
  JobManager.submit runs validate_amount on each non-None budget kwarg
  before constructing the Job, so programmatic callers and tests are
  guarded too — not just the REST route.

Finding 3 — budget-exceeded callback reported stale cap after /pdd
  budget mid-run
  _start_watcher_for closed over the cap value computed at submit and
  passed it to _handle_budget_exceeded(job_id, spent, cap). When
  update_budget lowered the cap mid-run, the watcher fired at the new
  cap but the callback still reported the original.

  Fix: drop the closure capture. _handle_budget_exceeded now takes
  (job_id, spent) only and recomputes the current effective cap from
  the (potentially updated) job.budget fields at fire time, so the
  emitted BudgetExceededMessage carries the active cap a /pdd budget
  comment would have set. Regression test submits cap=100, lowers it
  to 5 via update_budget, trips _handle_budget_exceeded with
  spent=10, and asserts the on_budget_exceeded callback receives
  cap=5 (not 100).
… auto-wire CSV

Third review pass on the runtime budget surface. Each fix has a
regression test that reproduces the broken behaviour on the prior code.

Finding 1 — /pdd budget node|max accepted for non-issue commands
  effective_cap() only consults node_budget / max_total_cap when
  command == "issue"; for any other command they are ignored. The
  parser nevertheless returned budget_node_set / budget_max_set for any
  active_command, so a user issuing "/pdd budget node 50" during a
  pdd-bug job would see the webhook acknowledge the change while the
  cap silently stayed put.

  Fix: _parse_budget rejects the node and max verbs with kind="invalid"
  and a message that points the user at "/pdd budget N" when
  active_command != "issue". Tests cover the rejection on bug/change/
  unknown plus the still-accepted issue path. Pre-existing tests that
  exercised node/max without an active_command were updated to pass
  active_command="issue".

Finding 2 — node_count not updateable through the public API
  pdd-issue's effective cap depends on node_count
  (min(node_budget * max(node_count or 1, 1), max_total_cap)). On a
  fresh submission node_count is None so effective_cap collapses to
  node_budget ($80) instead of the intended min($80 * N, $400). With
  no API path to push the growing node count, the run would stop too
  early unless the private executor mutated job.node_count out of band.

  Fix: BudgetUpdateRequest gains node_count: Optional[int] with its
  own validator (int, >= 0, <= 10000) and a model_validator that keeps
  the existing "at least one field set" rule covering all four fields.
  JobManager.update_budget accepts node_count via the _UNSET sentinel
  pattern (None = clear, _UNSET = leave alone). The /budget POST route
  threads the field through. A synchronous update_node_count helper is
  added for the subprocess driver thread (no awaitable round-trip
  needed). Tests verify the cap grows from 80 -> 240 -> 400 as
  node_count goes None -> 3 -> 10, and that node_count alone satisfies
  the "at least one" rule (the private executor pushes it standalone).

Finding 3 — capped job with no cost CSV path silently bypasses
  enforcement
  _resolve_cost_csv_path returned None whenever options.output_cost and
  PDD_OUTPUT_COST_PATH were both unset; _start_watcher_for then early-
  returned without a watcher. The job kept advertising a cap that no
  one was enforcing.

  Fix: when an effective cap is active and no explicit CSV path is
  configured, derive project_root/.pdd/cost-<job_id>.csv, mkdir -p its
  parent (logging a warning if that fails), and inject the path into
  job.options["output_cost"] so the subprocess writes to the same file
  the watcher reads via --output-cost. Uncapped jobs continue to skip
  the derivation so we don't litter .pdd/ with empty files. Tests
  verify the capped/uncapped/explicit branches and that the parent
  directory is created.
… node_count, mkdir for explicit cost CSV

Fourth review pass; each finding paired with a regression test that
reproduces the broken behaviour on the prior code.

Finding 1 — default executor would spawn nonexistent `pdd issue`
  ALLOWED_COMMANDS rightly accepts command='issue' so the route can
  apply pdd-issue defaults and hand the job to the private GitHub
  App's custom JobManager executor. But when JobManager has no
  custom executor (i.e. the public default-subprocess path),
  _run_click_command would still try to spawn `pdd issue`, which
  exits with "No such command 'issue'" because the public Click
  CLI has no `issue` subcommand. Operators reading the failure
  reasonably conclude the public CLI is broken, when in fact the
  job was misrouted.

  Fix: in _run_click_command, detect command=='issue' BEFORE
  building the subprocess args and raise RuntimeError with a
  message that explains custom executors are required for this
  command and points at the alternatives (sync/generate/bug/...).
  Regression: a JobManager with no executor raises the new clear
  error; a JobManager with a custom executor handles issue normally
  (the public default path is never reached).

Finding 2 — node_count=3.9 silently truncated to 3
  The earlier pydantic validator called int(v) directly, which
  truncates 3.9 to 3 without complaint. Same hole in the
  programmatic JobManager.update_budget / update_node_count
  paths. effective_cap math then ran on the truncated value
  without telling the caller — the same class of silent failure
  the prior reviewer flagged on the parser's node/max verbs.

  Fix: tighten both layers to reject fractional inputs.
  BudgetUpdateRequest._coerce_node_count rejects fractional floats
  (3.9 -> error) and fractional strings ('3.9' -> error) while
  still accepting 3.0 (unambiguously integral, useful for JSON
  callers) and '5' (string-int). A module-level
  _coerce_node_count_strict helper applies the same rules in
  JobManager so programmatic callers and the subprocess driver
  thread get identical strictness.

Finding 3 — explicit options.output_cost paths lack parent dir
  _resolve_cost_csv_path only mkdir-p'd the parent for derived
  default paths (Finding 3 of the previous pass). When the client
  passed an explicit options.output_cost like 'custom/cost.csv'
  and 'custom/' did not exist, track_cost would attempt to write,
  catch the OSError in its own swallow-all-errors block, and the
  watcher would silently stay at $0.

  Fix: in the explicit-candidate branch of
  _resolve_cost_csv_path, also call path.parent.mkdir(parents=True,
  exist_ok=True). On OSError, log a clear warning and continue
  (the original behaviour) rather than refuse the path entirely
  — the caller may have written perms after this point, and we
  do not want to hard-fail a job submission over a path we
  optimistically created early.

All 110 budget tests + 565 server/track_cost tests pass locally.
… exception

Fifth review pass. Each finding paired with a regression test.

Finding 1 — late /pdd budget cannot enforce on initially-uncapped runs
  An uncapped job started without --output-cost; the subprocess
  command line was fixed at spawn time. When the user later posted
  /pdd budget 30, the watcher started but had no CSV writer, so
  spend stayed at $0 forever and the documented "add a cap by
  commenting /pdd budget 30" path was unenforceable.

  Fix: _resolve_cost_csv_path is now called at submit() time
  regardless of whether a cap is currently set. The derived per-job
  path is injected into job.options["output_cost"] BEFORE the
  subprocess starts, so the subprocess emits --output-cost and
  track_cost writes rows during the uncapped window. A late
  update_budget then starts the watcher against an already-
  populated CSV. Regression test seeds a synthetic row during the
  uncapped window, calls update_budget(budget_cap=5), and asserts
  the watcher attaches and the snapshot reflects the new cap.

Finding 2 — two same-command jobs sharing a CSV count each other's spend
  Watchers filter rows by command + timestamp only — there is no
  job_id column. Two pdd-bug jobs that landed in the same CSV
  (either via an explicit options.output_cost both clients passed
  or via a process-wide PDD_OUTPUT_COST_PATH the parent process
  set) would each see all rows and cancel early.

  Fix: per-job CSV isolation in two places.
    (a) _resolve_cost_csv_path always derives a per-job path under
        project_root/.pdd/cost-<job_id>.csv when no explicit per-job
        path is supplied on job.options, ignoring the process-wide
        PDD_OUTPUT_COST_PATH (which was being treated as a default
        that quietly forced all jobs onto one file).
    (b) _run_click_command's subprocess env now sets
        PDD_OUTPUT_COST_PATH to the per-job derived path, or
        explicitly DELETES the inherited env var when no per-job
        path was resolved. Both layers (the --output-cost arg AND
        the env-var fallback in track_cost) now route to the
        per-job file. Regression tests assert two concurrent jobs
        get distinct derived paths and that a parent-process
        PDD_OUTPUT_COST_PATH does not leak across jobs.

Finding 3 — track_cost skips the row when the wrapped command raises
  Old code only wrote the CSV row when exception_raised was None.
  A subprocess that called the LLM (spending money) and then
  raised before returning recorded nothing — so the watcher's
  running spend missed those dollars and the cap could be bypassed
  by simply crashing after the LLM call.

  Fix: drop the exception_raised gate around the row write. Always
  attempt to emit a row in finally; source cost+model from the
  return tuple when available, otherwise fall back to whatever
  partial state llm_invoke pushed to
  ctx.obj['partial_cost' | 'last_model'] before the exception
  propagated. Documents the two new ctx.obj contract keys so
  future llm_invoke / executor changes can populate them. A
  failed command with no partial data still writes a row with
  cost=0 so the watcher's command-filtered counts reflect that
  the subprocess ran. Regression test invokes a track_cost-
  wrapped click command that sets partial_cost=4.25 on ctx.obj
  and then raises; asserts the CSV contains the 4.25 row.
…_invoke partial cost

Sixth review pass. Each finding paired with a regression test that
reproduces the broken behaviour on the prior code.

Finding 1 — relative explicit options.output_cost resolved against
  server cwd (not project_root)
  Server cwd may differ from project_root (the subprocess cwd). A
  caller passing options.output_cost="custom/cost.csv" then had the
  watcher reading server-cwd/custom/cost.csv while the subprocess
  wrote project_root/custom/cost.csv. Spend stayed $0.

  Fix: _resolve_cost_csv_path now absolutizes relative explicit paths
  against project_root and writes the absolute form back to
  job.options["output_cost"] so the --output-cost arg the subprocess
  receives is the same absolute path the watcher reads.

Finding 2 — two same-command jobs sharing one CSV count each other's
  spend
  Watchers filtered rows by command + timestamp only. Two pdd-bug jobs
  that both passed the same explicit options.output_cost (or any
  shared CSV) each saw the other's rows and could cancel from foreign
  spend.

  Fix: add a per-job attribution column.
    - track_cost reads PDD_JOB_ID from env and writes it as a new
       column. Three header generations are now handled by
      the writer: legacy (no attempted_models), mid (attempted_models
      only), and new (attempted_models + job_id) — existing CSVs are
      never rewritten, only appended to in their existing layout.
    - cost_budget_watcher.watch() takes an optional job_id parameter.
      When set, the watcher counts only rows whose job_id column
      matches; legacy rows missing the column are skipped (the
      conservative choice — never count rows we cannot attribute when
      attribution is required).
    - JobManager._run_click_command sets env['PDD_JOB_ID']=job.id; the
      watcher receives job_id=job.id at start. Concurrent same-command
      jobs sharing a CSV are now per-job-isolated.

Finding 3 — track_cost's exception-path fallback was a no-op in
  production
  My prior fix made track_cost write a row even when the wrapped
  command raised, falling back to ctx.obj['partial_cost' | 'last_model']
  for cost/model. But nothing in production populated those keys —
  only  was published — so failed commands still
  recorded cost=0 and budget enforcement still missed the spend.

  Fix: add the producer side. New module-level
  _publish_call_outcome_to_ctx(cost, model) in llm_invoke; called at
  each LLM-call success return (cloud path, responses path, chat path).
  partial_cost accumulates across calls within one tracked command;
  last_model overwrites with the most recent. Best-effort no-op when
  no Click context is active. Regression test invokes the helper
  twice and asserts the accumulated + overwrite semantics.

Test patterns updated to absorb the new trailing job_id column on
new-format CSVs while staying compatible with legacy/mid CSVs that do
not carry it.
…om executors, prompts in sync

Seventh review pass. Each runtime fix paired with a regression test;
prompts updated so the next regeneration does not erase the runtime
contracts.

Finding 1 — caller passes a legacy/mid-format CSV; job_id filter
  freezes spend at zero
  My prior fix made the watcher require row.get('job_id') == self._job_id
  unconditionally. For CSVs whose header lacks the job_id column
  entirely (legacy or mid-format files the caller explicitly passed via
  options.output_cost), every row was dropped and the watcher never
  saw spend.

  Fix: gate the job_id check on the column actually being present in
  the CSV header (self._fieldnames). Legacy/mid-format CSVs fall back
  to the command + timestamp filter (the historical enforcement
  behaviour), while new-format CSVs keep strict per-job isolation.
  Concurrent same-command jobs sharing a NEW-format CSV remain
  isolated; jobs sharing a LEGACY-format CSV are not (the caller
  opted into the shared file). Regression test seeds a mid-format
  CSV with one  row and asserts watcher.spent() == 50.

Finding 2 — custom executors bypass _run_click_command, so subprocesses
  they spawn never see PDD_JOB_ID
  The public JobManager set env['PDD_JOB_ID'] inside _run_click_command
  only. The private GitHub App's pdd-issue path uses a custom executor
  (JobManager(executor=...)) and never reaches that code, so every
  nested PDD subprocess wrote a row with empty job_id and the
  watcher (now job_id-filtered for new-format CSVs) skipped them.

  Fix: in _execute_job, set os.environ['PDD_JOB_ID']=job.id around the
  custom-executor call (try/finally to restore the prior value). Any
  subprocess the custom executor spawns inherits PDD_JOB_ID via
  process env, so track_cost writes rows with the correct
  attribution and the watcher counts them. Documents the
  max_concurrent>1 caveat in the integration contract (the os.environ
  mutation is best-effort under concurrency; the executor should
  also pass PDD_JOB_ID explicitly via subprocess env). Regression
  test asserts the env var is set during the custom executor's run
  and restored after it returns.

Finding 3 — prompts diverged from runtime contracts (regeneration
  would erase the runtime fixes)
  Four prompt files updated so the next pdd sync regenerates code
  consistent with current behaviour:
    - cost_budget_watcher_python.prompt: watch() signature gains
      job_id parameter; Inputs and Outputs documents naive-timestamp
      .astimezone() interop and the header-gated job_id filter.
    - track_cost_python.prompt: CSV Reader Contract documents three
      header generations (legacy / mid / new), UTC-aware timestamp
      writer, the new job_id column (env-driven from PDD_JOB_ID), and
      the always-write-row-on-exception rule with partial_cost /
      last_model ctx.obj fallback.
    - server/jobs_python.prompt: submit resolves per-job CSV at
      submit time regardless of cap; watcher wiring passes
      job_id=Job.id; PDD_JOB_ID env propagation for both the default
      subprocess path and the custom-executor path is documented.
    - llm_invoke_python.prompt: new producer-side requirement to
      define _publish_call_outcome_to_ctx(cost, model) and call it
      at each successful LLM-call return (cloud / responses / chat),
      accumulating partial_cost and overwriting last_model so
      track_cost has real data to write on the exception path.

All 573 server + budget + track_cost tests pass locally.
…nc architecture

Eighth review pass. Each runtime fix paired with a regression test; the
spec surface (prompts + architecture.json) updated so regeneration
stays consistent.

Finding 1 — os.environ['PDD_JOB_ID'] around custom executor races
  under max_concurrent>1
  Prior code did try/finally save+restore of os.environ['PDD_JOB_ID']
  around each custom-executor coroutine. With two concurrent jobs A
  and B, A's await yielded while B overwrote PDD_JOB_ID; when A
  resumed it read B's id, and B's finally restored A's leaked value.
  Reproduced locally with the exact pattern the reviewer described.

  Fix: remove the os.environ mutation entirely. Add
  JobManager.subprocess_env(job, *, base_env=None) -> Dict[str, str]
  which returns a per-job env dict carrying PDD_JOB_ID=job.id and the
  resolved PDD_OUTPUT_COST_PATH (or explicitly removing any inherited
  value). Custom executors MUST pass this dict as env= to
  subprocess.Popen / asyncio.create_subprocess_*; the default
  subprocess path uses the helper internally so both code paths
  share one implementation. Process-global state is never mutated,
  so any number of concurrent jobs are isolated by construction.
  Regression tests:
    - Two concurrent jobs get distinct PDD_JOB_ID values from
      subprocess_env (no contamination).
    - os.environ['PDD_JOB_ID'] stays None throughout a custom
      executor run (the prior mutation is gone).

Finding 2 — shared mid-format CSVs still cannot isolate jobs
  track_cost preserved the existing header to honour 'never break
  existing files', but that meant rows written to a pre-existing
  mid-format CSV (no job_id column) never carried job_id even when
  PDD_JOB_ID was set. The watcher's header-gated job_id filter then
  fell back to command+timestamp for that file, and two same-command
  jobs sharing the file again counted each other's spend.

  Fix: when track_cost would append to a mid-format CSV AND
  PDD_JOB_ID is set, MIGRATE the file in place — rewrite header to
  add the job_id column, backfill empty job_id on each existing row,
  atomic os.replace of a temp file. The CSV reader contract permits
  this 'legacy-header migration path' explicitly. After migration,
  the watcher's strict job_id filter kicks in for that file and
  per-job isolation actually works. When PDD_JOB_ID is empty (CLI
  use outside the server), no migration runs and the legacy
  behaviour is preserved.

Finding 3 — architecture.json had stale watcher signature and
  Job metadata
  Refresh JobManager interface block:
    - watch signature gains job_id=None
    - JobManager.update_budget signature includes node_count
    - new entries: JobManager.update_node_count,
      JobManager.subprocess_env
  Re-validates as JSON. Future arch-driven conformance passes will
  now agree with the runtime.

All 575 server + budget + track_cost tests pass locally.
…cy CSV migration, watcher inode detection

Ninth review pass. Each runtime fix paired with a regression test.

Finding 1 — legacy custom executors that do not call subprocess_env
  lose job_id attribution on spawned subprocesses
  Removing the os.environ mutation entirely (8th pass) made
  concurrency safe but broke any private executor that did not yet
  know to call JobManager.subprocess_env(job). Child PDD commands
  wrote empty job_id, the watcher skipped them, and a $10 spend
  under a $5 cap completed normally.

  Fix: re-introduce os.environ['PDD_JOB_ID']=job.id as a SAFETY NET
  but only under max_concurrent==1 (sequential execution → no race).
  Save the prior value and restore in finally so the env never leaks
  past this job's execution. Under max_concurrent>1 the manager
  still does not touch os.environ; the executor MUST call
  subprocess_env() and pass the result as env= to each spawned
  subprocess. Regression tests cover both paths: the safety net
  is present under max_concurrent=1 and absent under
  max_concurrent>1.

Finding 2 — legacy-format CSV (no attempted_models, no job_id) is
  not migrated; jobs sharing it still cross-count
  The 8th-pass mid → new migration handled the mid-format header
  case but left the oldest layout untouched. Two same-command jobs
  explicitly sharing a legacy CSV produced rows without job_id, the
  watcher fell back to command+timestamp, and one counted the
  other's spend ($30 visible to job-a from $10 + $20).

  Fix: add _migrate_legacy_to_new_header that rewrites the file in
  place adding BOTH attempted_models and job_id columns (existing
  rows get empty values for both). Triggered when track_cost would
  append to a legacy CSV AND PDD_JOB_ID is set. After migration the
  watcher's strict job_id filter activates and shared legacy files
  gain real per-job isolation. When PDD_JOB_ID is empty (CLI use
  outside the server), no migration runs and the legacy warn-and-
  append path is preserved.

Finding 3 — watcher does not detect track_cost's in-place migration
  (new inode at same path)
  track_cost's migration uses os.replace, which yields a new inode
  at the same path. The watcher's cached _byte_offset and
  _fieldnames remained based on the pre-migration file, so it never
  re-parsed the new header and the job_id filter never activated
  after migration — job-a still saw job-b's spend on the same CSV.

  Fix: track _known_inode in the watcher. On each poll, stat the
  file and compare st_ino against the cached value; on mismatch
  call _reset_tail_state() to discard the stale offset and header
  cache so the new header (which now carries job_id) is re-parsed
  on the next poll. Regression test seeds a mid-format CSV with a
  job-a row, lets the watcher see it under legacy fallback, then
  runs the migration helper and appends a job-b row; asserts the
  post-migration watcher's strict filter yields $0 (job-a's
  migrated row has empty job_id, job-b's row matches the wrong
  job_id).

Test infra: added an autouse fixture in test_budget_control.py that
saves/restores os.environ['PDD_JOB_ID'] around each test so the
safety net's intentional persistence does not contaminate sibling
test files that assume the env is clean.

All 578 server + budget + track_cost tests pass locally.
…mpts current

Tenth review pass. Three runtime fixes paired with regression tests, plus
the three prompt files re-synced so the next regeneration does not
erase any of this pass's fixes.

Finding 1 — safety net set PDD_JOB_ID but not PDD_OUTPUT_COST_PATH
  Legacy custom executors that do not call subprocess_env() relied on
  the safety net for env propagation. The prior safety net set only
  PDD_JOB_ID; the child subprocess had no cost-CSV path so track_cost
  wrote nothing, the watcher saw $0, and a capped run completed
  without enforcement.

  Fix: when max_concurrent==1, set BOTH PDD_JOB_ID and
  PDD_OUTPUT_COST_PATH (or explicitly remove the latter if no per-job
  CSV was resolved). Save and restore both in finally so the env is
  never leaked past this job. Regression test asserts the executor
  sees BOTH vars and that both are restored after the run.

Finding 2 — concurrent migrations race on shared .migrate.tmp
  Both migration helpers used a single hard-coded .migrate.tmp path
  with no lock. Two processes simultaneously appending to the same
  legacy/mid CSV could each read, each write their own tmp version,
  and the second writer's os.replace would clobber a row the first
  writer appended between the migration's read and replace.

  Fix:
    - per-writer unique tmp filename: <csv>.migrate.tmp.<pid>.<uuid>
      (new helper _unique_tmp_path) so two concurrent migrations
      cannot collide on the same tmp file.
    - POSIX fcntl.flock on a sidecar <csv>.migrate.lock around the
      whole migration (new _MigrationLock context manager). The
      second concurrent attempt receives a non-blocking lock failure,
      returns False, and the helper skips migration. The calling
      track_cost re-reads the header AFTER the migration call to
      detect the lock-contention skip and writes its row in whatever
      format the file is currently in — never with the new
      fieldnames against an unmigrated header.
    - graceful no-op when fcntl is unavailable (non-POSIX hosts):
      the lock returns False and we fall back to writing in the
      existing format. No corruption, slightly degraded isolation.
  Regression tests cover _unique_tmp_path uniqueness and
  in-process flock serialization (one thread acquires, the other
  is blocked).

Finding 3 — prompts diverged from runtime contracts again
  After the prior pass added the safety net and migration paths,
  the prompts still said the manager 'never mutates os.environ',
  legacy/mid CSVs 'are never rewritten', and legacy shared CSVs
  'are not isolated'. Regeneration would erase the runtime fixes.

  Fix:
    - server/jobs_python.prompt: documents the two-path env story
      — primary subprocess_env() contract (concurrency-safe) and
      the max_concurrent==1 os.environ safety net for legacy
      executors (with try/finally restore semantics).
    - track_cost_python.prompt: explains the legacy → new and
      mid → new migration paths, the unique per-writer tmp
      filename, the fcntl.flock on the sidecar .migrate.lock,
      and the lock-contention skip-and-re-read pattern.
    - cost_budget_watcher_python.prompt: explains the inode-
      change detection required to pick up post-migration
      headers and activate the strict job_id filter.

Performance note (reviewer): legacy/mid migration is O(file_size)
on first server-managed write to a pre-existing CSV. This is a
one-time cost — subsequent writes append in new format. Acceptable.

All 581 server + budget + track_cost tests pass locally.
…inked, prompt updated

Eleventh review pass. Three runtime fixes + prompt re-sync to match.

Finding 1 — concurrent append during migration loses rows
  The prior LOCK_NB skip-and-fall-back let one writer fall through
  to a legacy append while the other writer's migration os.replaced
  the file with a pre-append snapshot, silently deleting the
  contender's row. Reproduced in the wild: before_replace contained
  model_b, after_replace did not.

  Fix: replace _MigrationLock with _WriteLock that wraps the ENTIRE
  write block (read header, maybe migrate, append row) and uses a
  BLOCKING fcntl.flock(LOCK_EX). Contenders wait for the holder to
  release, then proceed with the post-migration header visible.
  The migration helpers themselves no longer take the lock — the
  caller (track_cost wrapper) holds it for the whole block. The
  helpers' docstrings call out the caller-must-hold invariant.
  Backwards-compat _MigrationLock alias retained.

Finding 2 — unlinking the sidecar lock file allowed inode reuse
  The prior __exit__ unlinked <csv>.migrate.lock as cleanup. A
  later process opening the same path created a NEW inode, while
  the previous holder was still closing the OLD inode; two
  processes ended up with exclusive locks on different inodes for
  the 'same' lock path.

  Fix: do NOT unlink the lock file. The sidecar is tiny and
  persistent; the inode stays stable across all callers as long
  as nobody removes it. Class docstring documents the trade-off.

Finding 3 — track_cost prompt contradicted itself on rewrites
  Row 5 said 'never truncated or rewritten in-place outside the
  legacy-header migration path' while row 8 described the
  migration as a rewrite, leaving the contract ambiguous to a
  regeneration pass.

  Fix: rewrite row 5 to say the file is APPEND-ONLY for ordinary
  writes and the ONLY rewrite path is the documented legacy-header
  migration; row 6 (concurrency) now explicitly describes the
  per-file flock that serialises the whole write block; row 8
  describes the blocking lock semantics, the never-unlink rule,
  and the per-writer unique tmp filename. The three rows now
  agree on a single coherent contract.

Test infra: relaxed track_cost mock-open assertions from
assert_called_once_with to assert_any_call so the additional
open() call on the sidecar lock file does not break unrelated
tests. The _WriteLock catches all Exceptions on the open/fcntl
path so mock-environment fileno() values don't crash production
code (the lock degrades to 'unenforced' which test cases tolerate).

All 582 server + budget + track_cost tests pass locally.
…t_budget, ms-precision started_at

Twelfth review pass. Four runtime fixes (Findings 1+2+3 plus a
precision bug surfaced while testing them) and three regression tests.

Finding 1 — fast-exit jobs miss the final cost row
  A subprocess that writes its final cost row and exits in <2s never
  gets observed by the daemon thread's next poll (cleanup stops the
  watcher first), so the cap is silently not enforced. Reproduced
  with a $5 row under a $1 cap completing as COMPLETED.

  Fix: Watcher.flush() now returns bool (fired or not) so callers
  can await the status flip. JobManager._execute_job runs
  _final_watcher_flush BEFORE the result-handling block: flush
  synchronously consumes any new bytes, fires on_exceeded if the
  cap is crossed, and yields to the event loop up to 50 times
  waiting for _handle_budget_exceeded to set BUDGET_EXCEEDED.
  The terminal-status guard on the success branch then preserves
  it (the COMPLETED assignment is skipped). A secondary flush in
  the finally block remains for the exception path. Regression
  test seeds the CSV from a custom executor with current
  timestamp + correct job_id and asserts status reaches
  BUDGET_EXCEEDED.

Finding 2 — /pdd settings reports $0 on uncapped runs
  Uncapped runs have no daemon watcher, so get_budget had no source
  of live spend during the run — job.cost is only set on subprocess
  exit, and the budget store snapshot is only created when a
  watcher starts.

  Fix: new module-level helper cost_budget_watcher.read_spent_now()
  does a one-shot synchronous CSV read using the same filtering
  logic as Watcher (commands, started_at, job_id). get_budget falls
  back to it when no watcher exists. /pdd settings on an active
  uncapped job now reports the actual accumulated spend.

Finding 3 — late /pdd budget on fast-exiting uncapped job
  When update_budget started a watcher for the first time, the
  daemon thread might not poll before the job exited, so existing
  rows in the uncapped window were not enforced.

  Fix: update_budget calls watcher.flush() synchronously after
  starting or updating, so any pre-update rows trigger the cap
  immediately if crossed. Regression test pre-writes $5 of spend,
  applies a $1 cap via update_budget, and asserts BUDGET_EXCEEDED
  fires.

Precision bug (surfaced by Finding 1's test)
  track_cost writes timestamps at millisecond precision (via
  isoformat(timespec='milliseconds')). JobManager records
  job.started_at with microsecond precision (datetime.now). A row
  written in the SAME millisecond as started_at has a parsed
  timestamp strictly less than started_at and was silently dropped
  by the watcher's ts < started_at filter.

  Fix: _normalize_started_at truncates to millisecond precision so
  the comparison aligns with the writer's resolution. Documented
  in the function docstring with the cross-module rationale.

All 585 server + budget + track_cost tests pass locally.
…t_budget

Thirteenth review pass. Three runtime correctness fixes around the
watcher / read paths.

Finding 1 — concurrent flush + daemon double-count the same bytes
  _consume_new_bytes was unlocked, so the daemon thread's poll and an
  inline flush() could each read from the same _byte_offset, parse
  the same rows, and each add cost to _spent. Reproduced as a single
  $5 row inflating to $10 of spend.

  Fix: lock changes to RLock (reentrant — flush already takes the
  lock after consume returns), and wrap the entire consume operation
  in self._lock via a new _consume_new_bytes_locked inner method.
  Two threads (or daemon+inline) now serialise; a row contributes to
  _spent exactly once. Regression test races two threads on
  watcher.flush() and asserts spent==5.0.

Finding 2 — read_spent_now spawned a daemon thread per call
  The prior implementation constructed a real Watcher (which starts a
  background thread) and then called _consume_new_bytes inline; the
  two paths raced and double-counted, turning a $20k CSV into a
  reported $40k. Per-call thread spawn was also pure overhead.

  Fix: rewrite read_spent_now as a pure function — direct
  csv.DictReader + filter + sum, no Watcher, no thread, no shared
  state. Filter semantics (commands set, started_at, job_id when
  column present) match Watcher's _row_matches exactly so results
  agree. Two regression tests: (a) 20 sequential reads of a $10 CSV
  must each return $10 AND create no persistent threads;
  (b) 100×$5 rows must sum to $500 (a double-count bug yields
  $1000).

Finding 3 — /pdd settings on capped run reports stale spend
  get_budget called watcher.spent() which returned the daemon's last
  cached value — up to poll_interval (2s) stale. A row written
  immediately before /pdd settings would be invisible until the next
  poll.

  Fix: get_budget now ALWAYS calls read_spent_now() for the freshest
  spend, regardless of whether a watcher is running. The watcher's
  cache is no longer consulted (it cannot be fresher than a direct
  CSV read). Capped runs and uncapped runs share the same code path.
  Regression test seeds a $3 row immediately after the watcher
  starts and asserts /pdd settings reports $3 without waiting for
  the next poll.

All 589 server + budget + track_cost tests pass locally.
…e, job.cost reconciled, perf

Fourteenth review pass. Three runtime correctness fixes + a perf
improvement (the reviewer's full-CSV-scan concern), each paired with a
regression test.

Finding 1 — update_budget can return before BUDGET_EXCEEDED is applied
  update_budget called watcher.flush() which schedules
  _handle_budget_exceeded via run_coroutine_threadsafe. The caller
  returned immediately; the executor's exit path then set COMPLETED
  before the handler ran, and the handler's status-active gate
  short-circuited the cancel — final status was COMPLETED, no event.
  Reproduced as: spent $5, cap $1, status COMPLETED.

  Fix: after flush(), if cumulative spend already crosses the new
  cap, await up to 50 × 10ms for job.status to enter
  _TERMINAL_STATUSES. The wait covers BOTH paths — the inline
  flush fired, AND the watcher daemon thread fired between
  _start_watcher_for and our flush (in which case flush returns
  False because _state.fired is already True, but the scheduled
  handler is still queued). Regression test seeds a $5 row,
  applies a $1 cap, and asserts status==BUDGET_EXCEEDED
  immediately after update_budget returns — no extra sleep.

Finding 2 — queued job's watcher had no time filter
  job.started_at is None until the executor runs. _start_watcher_for
  passed started_at=job.started_at, so the watcher had no lower
  time bound and counted historical rows in a shared legacy CSV.
  A day-old $99 row could cancel a queued job under /pdd budget 50
  before the job even started.

  Fix: _start_watcher_for uses
  job.started_at or datetime.now(timezone.utc) as the baseline. A
  queued job's watcher starts counting only from the moment the
  cap was set; once the job actually runs, started_at >= the
  baseline so no rows are missed. Regression test creates a
  queued job behind a long-running hold job, points it at a
  shared legacy CSV with a day-old $99 row under /pdd budget 50,
  and asserts the queued job is not cancelled by the historical
  row.

Finding 3 — /jobs/{job_id}.cost did not match /pdd settings spend
  job.cost was only set from the executor's returned dict — often
  0 for custom executors that spawn subprocesses they don't track
  themselves — even though track_cost rows recorded real spend.
  /jobs/{job_id} reported cost=0; /pdd settings reported real
  spend; the two endpoints disagreed.

  Fix: in _execute_job's finally block, sync job.cost to
  max(job.cost, _compute_csv_spend(job)) BEFORE emit_complete and
  watcher cleanup. _compute_csv_spend is a new helper that uses
  the watcher's flush+spent (incremental — cheap) when a watcher
  exists, falling back to read_spent_now for uncapped runs.
  Same helper is used by get_budget so the two endpoints can
  never diverge. Regression test asserts job.cost==5.0 after a
  custom executor returns {'cost': 0.0} but writes a $5 CSV row.

Perf — get_budget no longer always full-scans the CSV
  Reviewer measured ~0.26s on a 10MB / 200k-row shared CSV.
  get_budget now uses watcher.flush() + watcher.spent() when a
  watcher exists (incremental tail — only NEW bytes are parsed,
  not the whole file). Uncapped runs still use read_spent_now,
  but their per-job CSVs are bounded by one job's row count. The
  expensive full scan only happens for explicit shared-CSV
  uncapped runs.

All 592 server + budget + track_cost tests pass locally.
Five findings raised in the post-implementation code review of the
GitHub App budget surface. Each is paired with a regression test.

Finding 2 — cap-clear via REST not possible
  `POST /commands/jobs/{job_id}/budget` collapsed "field omitted"
  and "field explicitly None" onto the same "leave unchanged"
  behaviour, so a caller (slash-command webhook or programmatic)
  could never drop a previously-set budget_cap back to "no cap".
  The prompt's `BudgetStore.update` contract documents this
  distinction explicitly. Switch the route to forward only fields
  present in `request.model_fields_set`, and rewrite the
  `BudgetUpdateRequest` model_validator to enforce
  "at-least-one-field-set" via `model_fields_set` rather than
  value-is-None — so an explicit `{"budget_cap": null}` now
  passes validation and reaches `update_budget` as None (clear),
  while `{}` still 422s.

Finding 3 — parser prompt R6 silent on non-issue rejection
  The generated parser rejects `/pdd budget node N` and
  `/pdd budget max N` on non-issue active commands with a
  message redirecting to `/pdd budget N`, but the prompt's R6
  did not document this stricter behaviour. The next `pdd sync`
  could regenerate a permissive parser and pass review. Update
  R6 to spell out the rejection plus the rationale
  (`effective_cap()` ignores node_budget/max_total_cap for
  non-issue commands; accepting silently would set fields the
  cap math ignores).

Finding 4 — amount validation duplicated three places
  `validate_amount` (canonical, in `budget_settings.py`) was
  mirrored line-for-line in two pydantic field validators on
  `CommandRequest` and `BudgetUpdateRequest`, so changing the
  ceiling required touching three places. Extract a single
  module-level `_coerce_budget_amount_value` helper in
  `models.py` that both pydantic validators delegate to. The
  helper is kept in `models.py` (not imported from
  `budget_settings`) to avoid an import cycle — documented in
  the helper's docstring.

Finding 5 — `cancel_job` 200s on a BUDGET_EXCEEDED job
  The early-409 guard at the cancel route enumerated only
  `COMPLETED`/`FAILED`/`CANCELLED`, so a cancel posted after
  the watcher already terminated the job returned 200 with
  `cancelled=True` — misleading. Add `BUDGET_EXCEEDED` to the
  guard and document at `JobManager.cancel` that the manager
  still needs to handle `BUDGET_EXCEEDED` (the budget-handler
  path calls `cancel()` to actually terminate the subprocess
  AFTER flipping the status — so the manager's early-exit set
  intentionally excludes `BUDGET_EXCEEDED`).

All 145 budget-control tests, 455 server + track_cost tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three correctness/contract gaps surfaced by a stronger reviewer.
Each is paired with a regression test.

Finding A — pdd-issue budget_cap was a silent no-op
  `effective_cap("issue", ...)` IGNORES `budget_cap`; only
  `node_budget` and `max_total_cap` participate in the issue
  cap formula. A webhook handler literally forwarding
  `/pdd budget N` on a pdd-issue job as `{"budget_cap": N}`
  would therefore set a field the cap math never reads and the
  watcher would never enforce the requested limit. Reproduced
  live: `effective_cap("issue", budget_cap=30, node_budget=80,
  max_total_cap=400, node_count=1)` returns 80.0, not 30.

  Both routes (`POST /commands/execute` and
  `POST /commands/jobs/{job_id}/budget`) now re-alias a bare
  `budget_cap` to `max_total_cap` when the job's command is
  `issue`. The alias yields to an explicit `max_total_cap` so
  callers that send both keep the more-specific value. Prompt
  R3 / R6 references updated in commands_python.prompt; the
  parser already produces `budget_max_set` for this case, but
  the routes must finish the alias for callers that bypass the
  parser.

Finding B — daemon-fired race could end the job COMPLETED
  When the watcher's daemon poll observed the cap crossing but
  the scheduled `_handle_budget_exceeded` coroutine had not run
  before the executor exited, `_final_watcher_flush.flush()`
  returned False (the fire-once guard short-circuits when
  `_state.fired` is already True). The final-flush helper
  treated False as "nothing to wait for" and let the executor
  set `COMPLETED`; the handler then bailed because the job
  was no longer in QUEUED/RUNNING — final status was
  COMPLETED, no `budget_exceeded` event emitted.

  Expose a `Watcher.fired()` signal (watcher-wide "has the cap
  been crossed?" rather than flush()'s "did THIS call fire?")
  and use it as a second wait condition in `_final_watcher_flush`.
  Now the flush either fires inline (returns True) OR sees the
  daemon already fired (`fired()` True), and in both cases the
  helper waits up to 50×10ms for terminal status before
  control returns to the executor.

Finding C — runtime APIs not declared in prompt interfaces
  `Watcher.flush`, `Watcher.fired`, `read_spent_now`,
  `JobManager.subprocess_env`, `JobManager.update_node_count`,
  the `node_count` field on `BudgetUpdateRequest`, and the
  `node_count` kwarg on `update_budget` were all used at
  runtime but not declared in the prompts'
  `<pdd-interface>` blocks. A future `pdd sync` /
  `pdd generate` could legitimately drop any of them — the
  re-broken surface would re-introduce the race in Finding B
  AND silently regress the fresh-spend path on `/pdd settings`.
  Declare each in the matching prompt interface and in
  architecture.json so the conformance check enforces them
  on regeneration. Also tighten the R6 contract docs in
  `commands_python.prompt` for the alias + null-clear
  semantics (so a regenerated route preserves them).

All 152 budget-control + 455 server + 29 track_cost tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ands

A failed second tracked command sharing one `ctx.obj` with a prior
successful command used to read the prior command's
`partial_cost` and `last_model` and write them into its own CSV
row. Concretely: command A invokes the LLM,
`llm_invoke._publish_call_outcome_to_ctx` populates
`ctx.obj["partial_cost"] = 7.0` and `ctx.obj["last_model"] =
"model-a"`. Command B starts on the same `ctx.obj`, raises before
any LLM call, and `track_cost`'s exception-path fallback reads
`partial_cost` / `last_model` from `ctx.obj` — which still hold
A's values. B's row therefore claims B spent $7.0 on `model-a`,
inflating cumulative spend and potentially tripping a budget cap
on a command that itself spent nothing.

`attempted_models` was already snapshotted/cleared/restored for
exactly this reason; extend the same pattern to `partial_cost`
and `last_model`. Pop+restore (not just pop) so a nested
track_cost invocation inside a parent tracked command preserves
the parent's accumulated state on the way back up.

Also rewrites the daemon-fired race regression test to use a
custom executor that writes its cost row mid-flight (so it
passes the watcher's `started_at` filter) and sleeps long
enough for the daemon to poll and queue the handler before the
executor returns — the prior version of this test pre-seeded
the CSV before submit, which `started_at` correctly filtered
out so the race window was never exercised.

All 609 budget-control + server + track_cost tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three more contract/correctness gaps surfaced by a stronger reviewer
on the updated PR head. Each is paired with a regression test.

Finding A — non-scalar budget JSON returned HTTP 500
  `{"budget_cap": []}` reached `float([])`, which raises
  `TypeError`, which FastAPI converts to HTTP 500 — a malformed
  client request looked like a server error. The shared coercion
  helper now adds an explicit `isinstance(value, (int, float))`
  branch; non-numeric, non-string types fall to an explicit
  `ValueError`, which pydantic surfaces as HTTP 422 (the right
  outcome for invalid request JSON).

Finding B — null `budget_cap` on pdd-issue was a silent no-op
  The route's pdd-issue alias rule only fired when
  `budget_cap` was non-None, so a caller posting
  `{"budget_cap": null}` on an issue job intending "clear the
  alias" left the old `max_total_cap` active and the visible
  clear was a silent no-op. Aliasing an explicit None too is
  what makes the clear actually clear: the route now pops
  `budget_cap` (numeric OR explicit None) into the `max_total_cap`
  slot whenever the caller did not explicitly set the latter.

Finding C — no typed `budget_exceeded` event on the WebSocket
  `JobManager.callbacks.emit_budget_exceeded` fires when the
  watcher trips, and `BudgetExceededMessage` exists in the
  models module, but `create_websocket_routes` only registered
  `on_output` and `on_complete`. Subscribed clients never
  received the typed payload — only the subsequent `complete`
  message, with no way to distinguish budget abort from clean
  completion. Add `emit_job_budget_exceeded(job, spent,
  effective_cap)` and register an `on_budget_exceeded` callback
  that assembles the typed `BudgetExceededMessage` (carrying
  pdd-issue specifics: node_budget, max_total_cap, node_count)
  and broadcasts via `manager.broadcast_job_message`. The
  websocket prompt's `<pdd-interface>` requirements list is
  extended to lock the new helper + registration in for future
  regeneration. The route-test fixture's mock models module
  is updated to expose `BudgetExceededMessage` so the
  collection-time import in websocket.py resolves.

All 620 budget-control + server + track_cost tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Serhan-Asad and others added 3 commits May 23, 2026 12:31
Three runtime-correctness gaps surfaced by another deeper review
pass. Each is paired with a regression test that reproduces the
broken behaviour on the prior code.

Finding A — orphan watcher when /pdd budget is set on a queued job
  `update_budget` on a still-queued job with no active watcher
  (cap was None at submit) calls `_start_watcher_for` to spin
  one up. When `_execute_job` later runs, it calls
  `_start_watcher_for` AGAIN. The dict assignment
  `self._watchers[job.id] = new_watcher` overwrote the prior
  Watcher without calling its `.stop()` — the previous daemon
  thread kept polling the CSV, double-counting spend against
  the new sibling watcher, and could fire `on_exceeded` long
  after the job had moved past QUEUED/RUNNING (where the
  handler then silently no-ops). `_start_watcher_for` now
  calls `_stop_watcher_for(job.id)` at the top so the second
  call replaces the previous Watcher cleanly. The stop happens
  BEFORE the cap re-compute so a code path that ends up
  returning None (cap is now None) still flushes the previous
  watcher rather than leaking it.

Finding B — WebSocket reconnect hangs on BUDGET_EXCEEDED jobs
  `websocket_job_stream`'s "job already done — send result +
  close" branch enumerated `[COMPLETED, FAILED, CANCELLED]`. A
  client reconnecting after the watcher tripped the cap fell
  through to the input-loop and waited forever on
  `websocket.receive_text()`. Add `BUDGET_EXCEEDED` to the
  terminal-state set; the reconnect path now sends the typed
  `BudgetExceededMessage` FIRST (so the client renders the
  cap-trip UI without an extra REST round-trip), then the
  standard `complete` summary, then closes.

Finding C — `complete` could be emitted before `budget_exceeded`
  `_handle_budget_exceeded` awaited `cancel()` BEFORE
  `emit_budget_exceeded`. cancel() injects CancelledError into
  the executor task; the resulting finally block emits the
  `complete` callback. The event loop could schedule the
  finally-block emit ahead of the still-awaiting
  `emit_budget_exceeded`, so subscribers that close on
  `complete` (the common client pattern) missed the typed
  payload entirely. Swap the order: emit the typed callback
  first, then cancel. The typed event is now the LAST event
  observed before `complete`, which is the contract the
  WebSocket consumer needs.

All 623 budget-control + server + track_cost tests pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three runtime lifecycle gaps from another deep-review pass. Each is
paired with a regression test that fails on the prior code.

Finding A — watcher orphaned when /pdd budget is set on a queued
job and /pdd stop cancels before _execute_job runs
  `update_budget` starts a Watcher on a still-queued job via the
  "no watcher running yet" branch. If `/pdd stop` cancels the
  task before `_execute_job` ever runs, the cleanup site —
  `_execute_job`'s finally block — never executes and the
  watcher's daemon thread keeps polling forever. Move watcher
  cleanup into the `_on_task_done` callback registered by
  `submit`, which fires unconditionally on task completion
  (cancelled-while-queued, cancelled-while-running, or normal
  exit). `_stop_watcher_for` is already idempotent (pop()-based)
  so the original finally-block call now becomes a no-op when
  task-done already cleaned up.

Finding B — slow subscriber callback delayed subprocess
termination
  `_handle_budget_exceeded` `await`ed `emit_budget_exceeded`
  BEFORE `cancel()`. A slow `on_budget_exceeded` subscriber (e.g.
  a stalled WebSocket consumer) blocked the await chain for
  seconds, during which the subprocess kept running AND spending.
  Add a synchronous `_signal_cancel(job_id)` helper that sets
  the cancel event and SIGTERMs the subprocess WITHOUT awaiting
  anything. The handler now: (1) sets status + updates store;
  (2) signal-cancels synchronously (subprocess immediately
  starts dying); (3) awaits emit_budget_exceeded (slow client
  is fine — the process is already gone); (4) awaits full
  cancel() for task teardown + SIGKILL escalation. The typed
  event still beats `complete` to the wire because step 3
  runs before step 4 emits `complete` via the executor's
  finally block.

Finding C — reconnect payload aliased effective_cap to job.cost
  The WebSocket reconnect branch for a BUDGET_EXCEEDED job set
  `effective_cap=float(job.cost)`, so cap $400 with spend
  $401.23 reported back `effective_cap=$401.23` — collapsing
  the two values onto one number. The reconnect client had no
  way to know what the active cap was at the moment of
  crossing. Recompute the cap via
  `budget_settings.effective_cap(job.command, ...)` on the
  reconnect path; fall back to `job.cost` only when the
  budget_settings module is unavailable.

Also updates `jobs_python.prompt` and
`websocket_python.prompt`'s requirements lists to document
the three-step termination sequence, the idempotent
`_start_watcher_for`/task-done cleanup, and the
recompute-effective-cap-on-reconnect contract — so a future
`pdd sync` regeneration cannot silently regress any of them.

All 627 budget-control + server + track_cost tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three runtime-ordering gaps from another deep review pass. Each is
paired with a regression test that fails on the prior code.

Finding A — complete could still beat budget_exceeded to the wire
  Even after the prior pass added `_signal_cancel`, a slow
  `on_budget_exceeded` subscriber could let `_execute_job.finally`
  emit `complete` BEFORE `emit_budget_exceeded` finished. The
  contract that subscribers closing on `complete` still see the
  typed event was racey. Add a per-job
  `_budget_broadcast_done: Dict[str, asyncio.Event]` that
  `_handle_budget_exceeded` exposes BEFORE signalling cancel and
  sets AFTER awaiting `emit_budget_exceeded`. `_execute_job`'s
  finally now waits on this event (bounded by a timeout) before
  calling `emit_complete`, so the typed event always lands first.

  Critical follow-on: drop the prior `await self.cancel(job_id)`
  at the end of `_handle_budget_exceeded`. `cancel()` calls
  `task.cancel()`, which injects `CancelledError` into the
  executor task while its finally is awaiting the broadcast-done
  gate — the wait re-raises `CancelledError` and SKIPS
  `emit_complete`, breaking the new ordering. Subprocess
  termination is already handled by `_signal_cancel` (SIGTERM)
  plus `_escalate_kill` (SIGKILL escalation, finding B), so
  letting the executor exit naturally is both sufficient and
  necessary.

Finding B — SIGKILL escalation delayed by slow callback chain
  `_signal_cancel` only sent SIGTERM. A subprocess that ignores
  SIGTERM was only force-killed inside `cancel()`, which used to
  be awaited AFTER `emit_budget_exceeded`. A slow callback could
  leave the stubborn child running for the full callback
  duration. Add a synchronous `_signal_cancel` companion:
  `_escalate_kill(job_id)`, spawned via `loop.create_task` when
  a real subprocess is actually running. It polls the process
  for ~2s and SIGKILLs if still alive. The kill is now
  independent of any await chain, so spend stops accumulating
  immediately.

Finding C — live WebSocket stream never closes on terminal job
  `websocket_job_stream`'s input loop only `await`ed
  `receive_text()`. When the job reached a terminal state the
  route had no signal to break the receive_text wait — the
  connection stayed open indefinitely after `complete`. The
  loop now races `receive_text` against a 250ms job-status
  poll via `asyncio.wait(..., FIRST_COMPLETED)` and closes
  the WebSocket when the job becomes terminal. Subscribers
  no longer leak.

The `jobs_python.prompt` requirements list is also updated to
document the new termination sequence (5-step) and the
explicit "do not call self.cancel() from the budget handler"
invariant so the next `pdd sync` cannot regenerate the prior
behaviour.

All 175 budget-control tests + 91 route tests pass in their
canonical subsets. (One pre-existing test-isolation flake
between `tests/test_budget_control.py::TestMidFormatCsvMigration`
and `tests/server/routes/` collection — documented as a known
issue in `tests/server/conftest.py` — is unrelated to this
change and reproduces on prior commits when the same combo is
run.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add /pdd budget control comments for GitHub App runs (clean Opus rerun)

2 participants