feat: add /pdd budget control comments for GitHub App runs (#1128) by prompt-driven-github[bot] · Pull Request #1131 · promptdriven/pdd

prompt-driven-github · 2026-05-21T22:32:13Z

Summary

Adds the full GitHub App control-comment surface for budget management: startup settings comments, /pdd budget, /pdd budget node, /pdd budget max, /pdd settings, /pdd stop, with reactive active-run updates and budget enforcement for both normal commands and pdd-issue defaults. Per @Serhan-Asad's clarification, this is intentionally broader than a generic cost watcher — it covers the public prompts/specs/server helpers/interfaces in this repo that drive App/executor behavior.

Closes #1128

Changes Made

Prompts Modified

pdd/prompts/server/jobs_python.prompt — thread budget_cap / node_budget / max_total_cap / node_count through Job and JobManager.submit; add update_budget; integrate cost_budget_watcher; introduce BUDGET_EXCEEDED status; trigger existing cancel path on cap hit.
pdd/prompts/server/models_python.prompt — add BudgetSettings, BudgetUpdateRequest, BudgetExceededMessage, SlashCommandResult models and a new job-status enum value.
pdd/prompts/server/routes/commands_python.prompt — new GET / POST /commands/jobs/{job_id}/budget endpoints; defaults wiring for pdd-issue ($80 per node, $400 max).
pdd/prompts/track_cost_python.prompt — formalize the cost-CSV read contract so external watchers can rely on the schema.

Prompts Created

pdd/prompts/cost_budget_watcher_python.prompt — daemon-thread CSV poller with update_cap, idempotent .stop(), no-op when cap is None.
pdd/prompts/server/budget_settings_python.prompt — per-job settings store, effective-cap formula (min(node_budget * node_count, max_total_cap)), pdd-issue defaults, validators.
pdd/prompts/server/slash_command_parser_python.prompt — pure parser for /pdd ... commands with fenced-block / bot-comment / dedupe handling and authorization helpers.
pdd/prompts/server/budget_comments_python.prompt — pure Markdown renderers matching the issue's literal startup/settings/stop strings.

Documentation Updated

README.md — new "GitHub App control comments" subsection documenting startup comments and /pdd commands.
CHANGELOG.md — Unreleased entry referencing Add /pdd budget control comments for GitHub App runs (clean Opus rerun) #1128.
architecture.json — updated module graph for new modules.

Review Checklist

Prompt syntax is valid
PDD conventions followed
Documentation is up to date
Acceptance criteria from Add /pdd budget control comments for GitHub App runs (clean Opus rerun) #1128 are covered by the prompt specs

Next Steps After Merge

Regenerate code from modified prompts in dependency order:

./sync_order.sh

Or manually:

pdd sync cost_budget_watcher
pdd sync track_cost
pdd sync budget_comments
pdd sync budget_settings
pdd sync slash_command_parser
pdd sync cli
pdd sync executor
pdd sync auth
pdd sync commands
pdd sync files
pdd sync fix
pdd sync jobs
pdd sync utility
pdd sync app
pdd sync __init__
pdd sync checkup
pdd sync connect
pdd sync generate
pdd sync maintenance
pdd sync modify
pdd sync analysis
pdd sync models
pdd sync websocket

Run pytest -vv tests/ to verify functionality.
Wire the new prompts/models/endpoints into the private GitHub App webhook handler (out-of-repo) so the public surface is exercised end-to-end.

Created by pdd change workflow

Adds the full GitHub App control-comment surface for budget management: startup settings comments, /pdd budget, /pdd budget node, /pdd budget max, /pdd settings, /pdd stop, with reactive active-run updates and budget enforcement for both normal commands and pdd-issue defaults. Prompt changes (MODIFY): - prompts/server/jobs_python.prompt: thread budget_cap/node_budget/ max_total_cap/node_count, add update_budget, integrate watcher, add BUDGET_EXCEEDED status. - prompts/server/models_python.prompt: BudgetSettings, BudgetUpdateRequest, BudgetExceededMessage, SlashCommandResult, new enum value. - prompts/server/routes/commands_python.prompt: GET/POST /commands/jobs/{job_id}/budget; defaults wiring for pdd-issue. - prompts/track_cost_python.prompt: formalize read contract for watchers. Prompt changes (CREATE): - prompts/cost_budget_watcher_python.prompt: daemon-thread CSV poller with update_cap and idempotent stop. - prompts/server/budget_settings_python.prompt: per-job store + effective-cap formula + pdd-issue defaults + validators. - prompts/server/slash_command_parser_python.prompt: pure parser with fenced-block / bot / dedupe handling + authorization helpers. - prompts/server/budget_comments_python.prompt: pure Markdown renderers matching the issue's literal strings. Docs: - README.md: new "GitHub App control comments" subsection. - CHANGELOG.md: Unreleased entry referencing #1128. Closes #1128 Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Resolves architecture.json conflict by keeping main's updated description for agentic_checkup_orchestrator while preserving this PR's added module entries for cost_budget_watcher, budget_settings, slash_command_parser, and budget_comments.

Closes the four spec-level bugs flagged on PR #1131: Finding 2 (enforcement boundary): cost_budget_watcher and the README claimed enforcement happens "between LLM/tool calls", but track_cost only appends a row when a PDD subprocess exits. Documents enforcement as a SUBPROCESS boundary, not a mid-call boundary; explains the pdd-issue case (works well: many nested subprocesses) and the single- command case (cap can be overshot by one subprocess's spend) so readers do not over-claim. Finding 3 (pdd-issue cost attribution): the watcher used to take a single command=str filter. pdd-issue itself never writes a row with command="issue"; it spawns nested PDD subprocesses (change, sync, bug, fix, generate, test, ...) and each writes its own row. Filtering on "issue" summed to $0 and the cap would never fire. Changes the watcher to accept commands: Optional[Iterable[str]] (a set of accepted command names) and documents in jobs_python.prompt which set to pass for pdd-issue vs single-command jobs, with an explicit warning not to reintroduce the {"issue"} bug. Finding 4 (SlashCommandResult.metadata): the parser stored the validated amount in a metadata dict, but the model did not define a metadata field. Pydantic would either reject the extra field or drop the amount. Adds metadata: Dict[str, Any] (default_factory=dict) to SlashCommandResult in models_python.prompt and updates the parser doc to make the contract explicit (always a concrete dict; {"amount": <float>} for budget- mutating kinds, empty dict otherwise). Finding 5 (auth gate inconsistency): parser R5 used to gate ALL /pdd comments by is_authorized before parsing, which contradicted both the README ("other commenters can use /pdd settings") and render_unauthorized ("refer the user to /pdd settings"). Rewrites R5 to parse first, then gate by verb: budget-mutating verbs (budget_*, stop) require is_authorized; the read-only /pdd settings verb is open to anyone whose comment is parsed (subject to bot/fenced/dedupe filters). Updates the README and budget_comments R7 to match. Also reflects the watcher's command -> commands rename and the model's new metadata field in architecture.json so the next pdd sync does not flag drift. Includes a clean merge of upstream/main; resolves the architecture.json description conflict in agentic_checkup_orchestrator. Finding 1 (no generated runtime code) is intentionally NOT addressed in this commit: this PR was scoped as prompt-only with code generation in the original "Next Steps After Merge" plan. See the PR reply for the follow-up plan.

…pted-commands set The reader-contract narrative said "Watchers filter the CSV by command" (singular). The watcher's parameter is now commands (a set of accepted names), so future regenerations of track_cost could bake the singular assumption back in and silently reintroduce the pdd-issue zero-spend bug. Documents the set form, and includes an explicit anti-pattern note for the {"issue"} case.

Serhan-Asad · 2026-05-22T18:16:37Z

Thanks — all five findings hold up to primary-source verification. Findings 2–5 are now addressed in two commits on this branch (the merge of main, and the prompt-edit commit). Finding 1 is acknowledged below.

Finding 1 — prompt/docs only, no generated `.py`

Valid as scoped (intentional). The PR was scoped as a prompt/spec-level change, with code generation listed under "Next Steps After Merge". The current branch still ships only the prompt/doc/architecture.json diff; no pdd.cost_budget_watcher / pdd.server.budget_settings / slash_command_parser / budget_comments modules exist yet. Run-time behavior depends on (a) pdd sync regenerating the four new modules and updating jobs.py / models.py / commands.py, and (b) the private GitHub App webhook handler wiring the slash-command path to GitHub. (b) cannot land from this repo, so a follow-up PR scoped to (a) is the right unit to close the issue's runtime acceptance criteria; the prompts in this PR are the contract that generated code must satisfy.

The code-generation follow-up MUST also ship a real-LLM test file (per the project's prompt-only-PR convention) that asserts the generated parser preserves the Finding 4 (metadata.amount) and Finding 5 (verb-scoped auth) contracts, so a future pdd sync cannot silently regress them via LLM drift.

Finding 2 — CSV write timing vs "between LLM/tool calls"

Valid (architectural). track_cost writes the row from its finally block after the wrapped command returns (pdd/track_cost.py:60–131); a CSV-tail watcher cannot see spend inside a still-running subprocess. Fixed by being honest about the boundary: cost_budget_watcher_python.prompt, jobs_python.prompt, track_cost_python.prompt, and the README now describe enforcement as a subprocess boundary (works well for pdd-issue, whose executor spawns many nested PDD subprocesses; single long commands can overshoot the cap by exactly one subprocess's spend). The "between LLM/tool calls" claim is gone from README. Mid-call enforcement would require a separate live-cost signal (per-LLM-call flush in track_cost, or an in-process subscriber) — explicitly out of scope for this PR.

Finding 3 — `pdd-issue` cost attribution

Valid. pdd-issue never writes a row with command="issue"; nested subprocesses write rows with command in {change, sync, bug, fix, generate, test, ...} (pdd/track_cost.py:82). A command="issue" filter sums to $0. Fixed by changing the watcher signature from command: Optional[str] to commands: Optional[Iterable[str]] (a frozen set of accepted commands) and documenting in jobs_python.prompt exactly which set to pass for pdd-issue vs single-command jobs — with an explicit warning not to reintroduce the {"issue"} bug. track_cost_python.prompt's reader contract section and architecture.json updated to match.

Finding 4 — `SlashCommandResult.metadata` inconsistency

Valid. models_python.prompt did not declare a metadata field, but the parser stored the validated amount on metadata. Pydantic would have either rejected the extra field or dropped it. Fixed by adding metadata: Dict[str, Any] (with default_factory=dict) to the SlashCommandResult interface block AND the model documentation, and tightening the parser doc so it states explicitly that metadata is always a concrete dict ({"amount": <float>} for budget-mutating kinds, empty for the rest). architecture.json updated to match.

Finding 5 — auth gate vs `/pdd settings`

Valid. R5 used to gate ALL /pdd comments via is_authorized(...) before parsing, contradicting both the README ("other commenters can use /pdd settings for a read-only view") and render_unauthorized (which refers users to /pdd settings). Fixed by rewriting R5 to parse first, then gate by verb: budget-mutating verbs (budget_*, stop) require authorization; the read-only /pdd settings verb is open to anyone whose comment is parsed (subject to the existing bot / fenced-block / dedupe filters). README and render_unauthorized's R7 updated to match, and the auth gate is mentioned by name in the model doc so the design intent is visible in three places.

Process

Mergeable: CONFLICTING was caused by the divergence with main (description text for agentic_checkup_orchestrator in architecture.json). Resolved by merging main and keeping its updated description while preserving this PR's new module entries. Now MERGEABLE.
The previous auto-heal failure ran against the conflicting tree; CI is re-running against the new head.
Tests: this PR is still prompt-only by design. The full test cycle (pdd sync for each new module + the real-LLM contract test noted under Finding 1 + pytest -vv tests/) is the follow-up PR's responsibility.

… budget surface Implements the four new public modules whose contracts the earlier prompt edits defined, plus the matching updates to existing server modules and a focused test suite. Brings the PR from spec-only to a runnable surface that the private GitHub App webhook handler can wire to GitHub. New modules - pdd/cost_budget_watcher.py Daemon-thread CSV poller. Accepts a set-based commands filter (not a single name) so pdd-issue runs sum spend across nested subprocess rows (change/sync/bug/fix/generate/test/...). R1-R7 honored: fire-once, cap-update fires only on next crossing, no-cap is a no-op, missing CSV is $0, .stop() is idempotent, never kills the subprocess itself, tolerant of partial CSV rows. - pdd/server/budget_settings.py effective_cap() formula (issue: min(node_budget * max(node_count or 1, 1), max_total_cap); else: budget_cap), pdd_issue_defaults() -> (80.0, 400.0), validate_amount() with the $10000 hard ceiling, and a thread-safe BudgetStore. - pdd/server/budget_comments.py Pure Markdown renderers matching the issue's acceptance criteria verbatim (R10 money formatting; $<int> for integers, $<value:.2f> for fractions, always two decimals for 'Spent so far'). The render_unauthorized reply explicitly redirects rejected commenters to /pdd settings so the read-only verb's openness is visible. - pdd/server/slash_command_parser.py Pure parser. parse_comment never raises; classifies into budget_set / budget_node_set / budget_max_set / settings / stop / invalid / ignored. Sets metadata['amount'] on budget-mutating kinds; bare /pdd budget N on a pdd-issue job aliases to budget_max_set per R6. is_authorized is a stateless helper — authorisation is the webhook handler's job, scoped to mutating verbs only (R5). Modified modules - pdd/server/models.py Adds JobStatus.BUDGET_EXCEEDED, BudgetSettings, BudgetUpdateRequest (with field validators that coerce '$30'/'30.00' to float and enforce > 0 and <= 10000), BudgetExceededMessage (WSMessage variant emitted exactly once on cap crossing), and SlashCommandResult with a concrete metadata: Dict[str, Any] field (default_factory=dict) so Pydantic never drops the parser's amount. - pdd/server/jobs.py Job dataclass gets optional budget fields. JobManager learns the submit budget_cap/node_budget/max_total_cap kwargs, update_budget (with KeyError/RuntimeError/ValueError exception types as the public contract), get_budget (snapshot for /pdd settings), and watcher wiring keyed on resolved cost-CSV path (job.options output_cost, falling back to PDD_OUTPUT_COST_PATH). On cap crossing, _handle_budget_exceeded reuses the cancel path that /pdd stop drives, marks status BUDGET_EXCEEDED, and emits the new on_budget_exceeded callback. Watcher is stopped in finally so it never outlives the job. - pdd/server/routes/commands.py Threads budget_cap/node_budget/max_total_cap from CommandRequest to JobManager.submit; applies pdd_issue_defaults() ($80/$400) when no explicit fields are present on a pdd-issue submission. Adds GET /commands/jobs/{job_id}/budget (KeyError -> 404) and POST /commands/jobs/{job_id}/budget (KeyError -> 404, ValueError -> 400, RuntimeError 'job not active: ...' -> 409), discriminating on exception type only — never on message text. Tests - tests/test_budget_control.py: 73 unit tests covering effective_cap, validate_amount, BudgetStore, the parser (including Finding 3/4/5 regression guards: filtering on {'issue'} stays $0; metadata.amount is set on budget kinds; /pdd settings returns kind='settings' with empty metadata and is NOT gated by auth at parse time), all comment renderers, and watcher behaviors (fire-once, cap update, no-cap no-op, started_at filter, missing CSV, idempotent stop). - tests/test_budget_control_real.py: @pytest.mark.real test that regenerates the parser via pdd generate and re-asserts the Finding 4 (metadata.amount) and Finding 5 (read-only settings, R6 issue alias) contracts. Guards against LLM drift on future syncs. - tests/server/routes/test_commands.py: updated submit assertion to include the new budget kwargs (test was previously asserting the pre-budget call signature). All 73 new tests + 499 existing server tests pass locally; full server suite is clean. Watcher enforcement is documented as a subprocess boundary (not 'between LLM/tool calls'), matching the README and cost_budget_watcher / track_cost prompt updates already on this branch.

…, perf Addresses follow-up review on the runtime budget surface; each fix ships with a regression test that asserts the broken case stays broken before the fix and passes after. Finding 1 — naive vs aware timestamp comparison breaks enforcement track_cost writes timestamps via datetime.now().strftime('%Y-%m-%dT...'), which is NAIVE local time (track_cost.py wraps Click commands; the reader-contract docstring says UTC but the actual writer is naive). The watcher's job.started_at is timezone-aware UTC. Comparing the two raises TypeError, which the watcher caught and swallowed at the poll level — so every row was skipped, spend stayed at $0, and the cap never fired in real production CSVs. Fix: _parse_timestamp coerces naive CSV cells to UTC before comparison (matching what the reader contract documented all along); _normalize_- started_at does the same to the caller's started_at so naive callers also work. Regression test seeds a naive-format row that asserts spent() > 0. Finding 2 — BUDGET_EXCEEDED status lost to CANCELLED race _handle_budget_exceeded used to call cancel() FIRST and then assign BUDGET_EXCEEDED. cancel() injects asyncio.CancelledError into the running task; that lands in _execute_job's except handler which set status=CANCELLED unconditionally. Depending on scheduling, the CancelledError handler ran AFTER our BUDGET_EXCEEDED assignment and silently demoted the status. /pdd settings and the final budget-exceeded comment then lost the actual reason. Fix: set status=BUDGET_EXCEEDED (and budget store snapshot) BEFORE calling cancel(). Add a module-level _TERMINAL_STATUSES set and gate every status assignment in _execute_job, _execute_wrapper, cancel(), and _on_task_done so they never overwrite a terminal status — so even if scheduling changes in the future, BUDGET_EXCEEDED is permanent once set. Async regression test submits a job with a sleeping fake executor, manually trips _handle_budget_exceeded, waits 500ms for the cancel race to play out, and asserts status remains BUDGET_EXCEEDED. Finding 3 — POST /commands/execute rejects command='issue' ALLOWED_COMMANDS did not include 'issue', so the validation gate at the top of execute_command raised HTTP 400 before the pdd_issue_defaults branch could run. The GitHub App's documented call path (command='issue' with empty budget fields → defaults to $80/$400) was unreachable. Fix: add 'issue' to ALLOWED_COMMANDS with a description that names the defaults branch. Regression tests assert both (a) command='issue' with no budget fields applies the defaults and (b) command='issue' with an explicit node_budget skips them. Finding 4 — full CSV reread every poll is O(rows × polls) Old _read_spent walked the entire DictReader every 2s. For a long pdd-issue run with hundreds of nested-subprocess rows, this is unbounded disk I/O per watcher. Fix: rewrite as a true incremental tail. Track _byte_offset of the first unread byte and a cached _known_size; on each poll, stat the file, seek to _byte_offset, parse only the new bytes, advance the offset, and add to a running _spent total. Partial trailing rows (a row being flushed) are left for the next poll (rfind newline, leave leftover bytes). Truncation/rotation (file shrinks) resets the tail state and re-scans from zero so the watcher can never get stuck pointing past EOF. Regression test patches csv.reader to count invocations, appends 5 rows over 5 polls, and asserts the reader call count stays bounded. Finding 5 (nit) — CHANGELOG drift on enforcement boundary CHANGELOG still said "enforces the active cap between LLM calls" even after README/cost_budget_watcher/track_cost prompts had been updated to say subprocess boundary. Reworded to match. All 79 budget tests + 544 server/budget tests pass locally; no other tests touched.

…cap callback Three follow-up review findings on the runtime budget surface; each fix ships with a regression test that reproduces the broken behavior on the prior code and passes after the fix. Finding 1 — non-UTC workers may still drop current-time rows track_cost wrote naive local-time timestamps via datetime.now() and the watcher reinterpreted naive cells as UTC outright. A naive '11:30 local PDT' row was treated as '11:30 UTC' — which is 7 hours EARLIER than the job's actual UTC start, so the row was filtered out by the started_at gate and spend stayed at $0. Fix: track_cost now writes timezone-aware ISO via datetime.now(timezone.utc).isoformat(timespec='milliseconds'), so rows carry an explicit '+00:00' offset and there is nothing to reinterpret. The watcher's _parse_timestamp uses .astimezone() in both branches: aware values are converted to UTC for a uniform comparison frame; legacy naive cells (from older CSVs written before this fix) are treated as local time and converted to UTC — never reinterpreted as UTC outright. Existing track_cost regression patterns updated to allow the new '+00:00' suffix. Finding 2 — initial budget fields skipped validate_amount CommandRequest and JobManager.submit accepted any float (negatives, > $10000, NaN, inf) on initial submission. Only update_budget ran validate_amount. A submission with budget_cap=-1 stored effective_cap=-1.0 and behaved as if there were a cap, but the arithmetic was nonsense. Fix: add a field_validator on CommandRequest.{budget_cap, node_budget, max_total_cap} that applies the same > 0 and <= $10000 rule as BudgetUpdateRequest, and string-form coercion ('$30', '30.00'). JobManager.submit runs validate_amount on each non-None budget kwarg before constructing the Job, so programmatic callers and tests are guarded too — not just the REST route. Finding 3 — budget-exceeded callback reported stale cap after /pdd budget mid-run _start_watcher_for closed over the cap value computed at submit and passed it to _handle_budget_exceeded(job_id, spent, cap). When update_budget lowered the cap mid-run, the watcher fired at the new cap but the callback still reported the original. Fix: drop the closure capture. _handle_budget_exceeded now takes (job_id, spent) only and recomputes the current effective cap from the (potentially updated) job.budget fields at fire time, so the emitted BudgetExceededMessage carries the active cap a /pdd budget comment would have set. Regression test submits cap=100, lowers it to 5 via update_budget, trips _handle_budget_exceeded with spent=10, and asserts the on_budget_exceeded callback receives cap=5 (not 100).

… auto-wire CSV Third review pass on the runtime budget surface. Each fix has a regression test that reproduces the broken behaviour on the prior code. Finding 1 — /pdd budget node|max accepted for non-issue commands effective_cap() only consults node_budget / max_total_cap when command == "issue"; for any other command they are ignored. The parser nevertheless returned budget_node_set / budget_max_set for any active_command, so a user issuing "/pdd budget node 50" during a pdd-bug job would see the webhook acknowledge the change while the cap silently stayed put. Fix: _parse_budget rejects the node and max verbs with kind="invalid" and a message that points the user at "/pdd budget N" when active_command != "issue". Tests cover the rejection on bug/change/ unknown plus the still-accepted issue path. Pre-existing tests that exercised node/max without an active_command were updated to pass active_command="issue". Finding 2 — node_count not updateable through the public API pdd-issue's effective cap depends on node_count (min(node_budget * max(node_count or 1, 1), max_total_cap)). On a fresh submission node_count is None so effective_cap collapses to node_budget ($80) instead of the intended min($80 * N, $400). With no API path to push the growing node count, the run would stop too early unless the private executor mutated job.node_count out of band. Fix: BudgetUpdateRequest gains node_count: Optional[int] with its own validator (int, >= 0, <= 10000) and a model_validator that keeps the existing "at least one field set" rule covering all four fields. JobManager.update_budget accepts node_count via the _UNSET sentinel pattern (None = clear, _UNSET = leave alone). The /budget POST route threads the field through. A synchronous update_node_count helper is added for the subprocess driver thread (no awaitable round-trip needed). Tests verify the cap grows from 80 -> 240 -> 400 as node_count goes None -> 3 -> 10, and that node_count alone satisfies the "at least one" rule (the private executor pushes it standalone). Finding 3 — capped job with no cost CSV path silently bypasses enforcement _resolve_cost_csv_path returned None whenever options.output_cost and PDD_OUTPUT_COST_PATH were both unset; _start_watcher_for then early- returned without a watcher. The job kept advertising a cap that no one was enforcing. Fix: when an effective cap is active and no explicit CSV path is configured, derive project_root/.pdd/cost-<job_id>.csv, mkdir -p its parent (logging a warning if that fails), and inject the path into job.options["output_cost"] so the subprocess writes to the same file the watcher reads via --output-cost. Uncapped jobs continue to skip the derivation so we don't litter .pdd/ with empty files. Tests verify the capped/uncapped/explicit branches and that the parent directory is created.

… node_count, mkdir for explicit cost CSV Fourth review pass; each finding paired with a regression test that reproduces the broken behaviour on the prior code. Finding 1 — default executor would spawn nonexistent `pdd issue` ALLOWED_COMMANDS rightly accepts command='issue' so the route can apply pdd-issue defaults and hand the job to the private GitHub App's custom JobManager executor. But when JobManager has no custom executor (i.e. the public default-subprocess path), _run_click_command would still try to spawn `pdd issue`, which exits with "No such command 'issue'" because the public Click CLI has no `issue` subcommand. Operators reading the failure reasonably conclude the public CLI is broken, when in fact the job was misrouted. Fix: in _run_click_command, detect command=='issue' BEFORE building the subprocess args and raise RuntimeError with a message that explains custom executors are required for this command and points at the alternatives (sync/generate/bug/...). Regression: a JobManager with no executor raises the new clear error; a JobManager with a custom executor handles issue normally (the public default path is never reached). Finding 2 — node_count=3.9 silently truncated to 3 The earlier pydantic validator called int(v) directly, which truncates 3.9 to 3 without complaint. Same hole in the programmatic JobManager.update_budget / update_node_count paths. effective_cap math then ran on the truncated value without telling the caller — the same class of silent failure the prior reviewer flagged on the parser's node/max verbs. Fix: tighten both layers to reject fractional inputs. BudgetUpdateRequest._coerce_node_count rejects fractional floats (3.9 -> error) and fractional strings ('3.9' -> error) while still accepting 3.0 (unambiguously integral, useful for JSON callers) and '5' (string-int). A module-level _coerce_node_count_strict helper applies the same rules in JobManager so programmatic callers and the subprocess driver thread get identical strictness. Finding 3 — explicit options.output_cost paths lack parent dir _resolve_cost_csv_path only mkdir-p'd the parent for derived default paths (Finding 3 of the previous pass). When the client passed an explicit options.output_cost like 'custom/cost.csv' and 'custom/' did not exist, track_cost would attempt to write, catch the OSError in its own swallow-all-errors block, and the watcher would silently stay at $0. Fix: in the explicit-candidate branch of _resolve_cost_csv_path, also call path.parent.mkdir(parents=True, exist_ok=True). On OSError, log a clear warning and continue (the original behaviour) rather than refuse the path entirely — the caller may have written perms after this point, and we do not want to hard-fail a job submission over a path we optimistically created early. All 110 budget tests + 565 server/track_cost tests pass locally.

… exception Fifth review pass. Each finding paired with a regression test. Finding 1 — late /pdd budget cannot enforce on initially-uncapped runs An uncapped job started without --output-cost; the subprocess command line was fixed at spawn time. When the user later posted /pdd budget 30, the watcher started but had no CSV writer, so spend stayed at $0 forever and the documented "add a cap by commenting /pdd budget 30" path was unenforceable. Fix: _resolve_cost_csv_path is now called at submit() time regardless of whether a cap is currently set. The derived per-job path is injected into job.options["output_cost"] BEFORE the subprocess starts, so the subprocess emits --output-cost and track_cost writes rows during the uncapped window. A late update_budget then starts the watcher against an already- populated CSV. Regression test seeds a synthetic row during the uncapped window, calls update_budget(budget_cap=5), and asserts the watcher attaches and the snapshot reflects the new cap. Finding 2 — two same-command jobs sharing a CSV count each other's spend Watchers filter rows by command + timestamp only — there is no job_id column. Two pdd-bug jobs that landed in the same CSV (either via an explicit options.output_cost both clients passed or via a process-wide PDD_OUTPUT_COST_PATH the parent process set) would each see all rows and cancel early. Fix: per-job CSV isolation in two places. (a) _resolve_cost_csv_path always derives a per-job path under project_root/.pdd/cost-<job_id>.csv when no explicit per-job path is supplied on job.options, ignoring the process-wide PDD_OUTPUT_COST_PATH (which was being treated as a default that quietly forced all jobs onto one file). (b) _run_click_command's subprocess env now sets PDD_OUTPUT_COST_PATH to the per-job derived path, or explicitly DELETES the inherited env var when no per-job path was resolved. Both layers (the --output-cost arg AND the env-var fallback in track_cost) now route to the per-job file. Regression tests assert two concurrent jobs get distinct derived paths and that a parent-process PDD_OUTPUT_COST_PATH does not leak across jobs. Finding 3 — track_cost skips the row when the wrapped command raises Old code only wrote the CSV row when exception_raised was None. A subprocess that called the LLM (spending money) and then raised before returning recorded nothing — so the watcher's running spend missed those dollars and the cap could be bypassed by simply crashing after the LLM call. Fix: drop the exception_raised gate around the row write. Always attempt to emit a row in finally; source cost+model from the return tuple when available, otherwise fall back to whatever partial state llm_invoke pushed to ctx.obj['partial_cost' | 'last_model'] before the exception propagated. Documents the two new ctx.obj contract keys so future llm_invoke / executor changes can populate them. A failed command with no partial data still writes a row with cost=0 so the watcher's command-filtered counts reflect that the subprocess ran. Regression test invokes a track_cost- wrapped click command that sets partial_cost=4.25 on ctx.obj and then raises; asserts the CSV contains the 4.25 row.

…_invoke partial cost Sixth review pass. Each finding paired with a regression test that reproduces the broken behaviour on the prior code. Finding 1 — relative explicit options.output_cost resolved against server cwd (not project_root) Server cwd may differ from project_root (the subprocess cwd). A caller passing options.output_cost="custom/cost.csv" then had the watcher reading server-cwd/custom/cost.csv while the subprocess wrote project_root/custom/cost.csv. Spend stayed $0. Fix: _resolve_cost_csv_path now absolutizes relative explicit paths against project_root and writes the absolute form back to job.options["output_cost"] so the --output-cost arg the subprocess receives is the same absolute path the watcher reads. Finding 2 — two same-command jobs sharing one CSV count each other's spend Watchers filtered rows by command + timestamp only. Two pdd-bug jobs that both passed the same explicit options.output_cost (or any shared CSV) each saw the other's rows and could cancel from foreign spend. Fix: add a per-job attribution column. - track_cost reads PDD_JOB_ID from env and writes it as a new column. Three header generations are now handled by the writer: legacy (no attempted_models), mid (attempted_models only), and new (attempted_models + job_id) — existing CSVs are never rewritten, only appended to in their existing layout. - cost_budget_watcher.watch() takes an optional job_id parameter. When set, the watcher counts only rows whose job_id column matches; legacy rows missing the column are skipped (the conservative choice — never count rows we cannot attribute when attribution is required). - JobManager._run_click_command sets env['PDD_JOB_ID']=job.id; the watcher receives job_id=job.id at start. Concurrent same-command jobs sharing a CSV are now per-job-isolated. Finding 3 — track_cost's exception-path fallback was a no-op in production My prior fix made track_cost write a row even when the wrapped command raised, falling back to ctx.obj['partial_cost' | 'last_model'] for cost/model. But nothing in production populated those keys — only was published — so failed commands still recorded cost=0 and budget enforcement still missed the spend. Fix: add the producer side. New module-level _publish_call_outcome_to_ctx(cost, model) in llm_invoke; called at each LLM-call success return (cloud path, responses path, chat path). partial_cost accumulates across calls within one tracked command; last_model overwrites with the most recent. Best-effort no-op when no Click context is active. Regression test invokes the helper twice and asserts the accumulated + overwrite semantics. Test patterns updated to absorb the new trailing job_id column on new-format CSVs while staying compatible with legacy/mid CSVs that do not carry it.

…om executors, prompts in sync Seventh review pass. Each runtime fix paired with a regression test; prompts updated so the next regeneration does not erase the runtime contracts. Finding 1 — caller passes a legacy/mid-format CSV; job_id filter freezes spend at zero My prior fix made the watcher require row.get('job_id') == self._job_id unconditionally. For CSVs whose header lacks the job_id column entirely (legacy or mid-format files the caller explicitly passed via options.output_cost), every row was dropped and the watcher never saw spend. Fix: gate the job_id check on the column actually being present in the CSV header (self._fieldnames). Legacy/mid-format CSVs fall back to the command + timestamp filter (the historical enforcement behaviour), while new-format CSVs keep strict per-job isolation. Concurrent same-command jobs sharing a NEW-format CSV remain isolated; jobs sharing a LEGACY-format CSV are not (the caller opted into the shared file). Regression test seeds a mid-format CSV with one row and asserts watcher.spent() == 50. Finding 2 — custom executors bypass _run_click_command, so subprocesses they spawn never see PDD_JOB_ID The public JobManager set env['PDD_JOB_ID'] inside _run_click_command only. The private GitHub App's pdd-issue path uses a custom executor (JobManager(executor=...)) and never reaches that code, so every nested PDD subprocess wrote a row with empty job_id and the watcher (now job_id-filtered for new-format CSVs) skipped them. Fix: in _execute_job, set os.environ['PDD_JOB_ID']=job.id around the custom-executor call (try/finally to restore the prior value). Any subprocess the custom executor spawns inherits PDD_JOB_ID via process env, so track_cost writes rows with the correct attribution and the watcher counts them. Documents the max_concurrent>1 caveat in the integration contract (the os.environ mutation is best-effort under concurrency; the executor should also pass PDD_JOB_ID explicitly via subprocess env). Regression test asserts the env var is set during the custom executor's run and restored after it returns. Finding 3 — prompts diverged from runtime contracts (regeneration would erase the runtime fixes) Four prompt files updated so the next pdd sync regenerates code consistent with current behaviour: - cost_budget_watcher_python.prompt: watch() signature gains job_id parameter; Inputs and Outputs documents naive-timestamp .astimezone() interop and the header-gated job_id filter. - track_cost_python.prompt: CSV Reader Contract documents three header generations (legacy / mid / new), UTC-aware timestamp writer, the new job_id column (env-driven from PDD_JOB_ID), and the always-write-row-on-exception rule with partial_cost / last_model ctx.obj fallback. - server/jobs_python.prompt: submit resolves per-job CSV at submit time regardless of cap; watcher wiring passes job_id=Job.id; PDD_JOB_ID env propagation for both the default subprocess path and the custom-executor path is documented. - llm_invoke_python.prompt: new producer-side requirement to define _publish_call_outcome_to_ctx(cost, model) and call it at each successful LLM-call return (cloud / responses / chat), accumulating partial_cost and overwriting last_model so track_cost has real data to write on the exception path. All 573 server + budget + track_cost tests pass locally.

…nc architecture Eighth review pass. Each runtime fix paired with a regression test; the spec surface (prompts + architecture.json) updated so regeneration stays consistent. Finding 1 — os.environ['PDD_JOB_ID'] around custom executor races under max_concurrent>1 Prior code did try/finally save+restore of os.environ['PDD_JOB_ID'] around each custom-executor coroutine. With two concurrent jobs A and B, A's await yielded while B overwrote PDD_JOB_ID; when A resumed it read B's id, and B's finally restored A's leaked value. Reproduced locally with the exact pattern the reviewer described. Fix: remove the os.environ mutation entirely. Add JobManager.subprocess_env(job, *, base_env=None) -> Dict[str, str] which returns a per-job env dict carrying PDD_JOB_ID=job.id and the resolved PDD_OUTPUT_COST_PATH (or explicitly removing any inherited value). Custom executors MUST pass this dict as env= to subprocess.Popen / asyncio.create_subprocess_*; the default subprocess path uses the helper internally so both code paths share one implementation. Process-global state is never mutated, so any number of concurrent jobs are isolated by construction. Regression tests: - Two concurrent jobs get distinct PDD_JOB_ID values from subprocess_env (no contamination). - os.environ['PDD_JOB_ID'] stays None throughout a custom executor run (the prior mutation is gone). Finding 2 — shared mid-format CSVs still cannot isolate jobs track_cost preserved the existing header to honour 'never break existing files', but that meant rows written to a pre-existing mid-format CSV (no job_id column) never carried job_id even when PDD_JOB_ID was set. The watcher's header-gated job_id filter then fell back to command+timestamp for that file, and two same-command jobs sharing the file again counted each other's spend. Fix: when track_cost would append to a mid-format CSV AND PDD_JOB_ID is set, MIGRATE the file in place — rewrite header to add the job_id column, backfill empty job_id on each existing row, atomic os.replace of a temp file. The CSV reader contract permits this 'legacy-header migration path' explicitly. After migration, the watcher's strict job_id filter kicks in for that file and per-job isolation actually works. When PDD_JOB_ID is empty (CLI use outside the server), no migration runs and the legacy behaviour is preserved. Finding 3 — architecture.json had stale watcher signature and Job metadata Refresh JobManager interface block: - watch signature gains job_id=None - JobManager.update_budget signature includes node_count - new entries: JobManager.update_node_count, JobManager.subprocess_env Re-validates as JSON. Future arch-driven conformance passes will now agree with the runtime. All 575 server + budget + track_cost tests pass locally.

…cy CSV migration, watcher inode detection Ninth review pass. Each runtime fix paired with a regression test. Finding 1 — legacy custom executors that do not call subprocess_env lose job_id attribution on spawned subprocesses Removing the os.environ mutation entirely (8th pass) made concurrency safe but broke any private executor that did not yet know to call JobManager.subprocess_env(job). Child PDD commands wrote empty job_id, the watcher skipped them, and a $10 spend under a $5 cap completed normally. Fix: re-introduce os.environ['PDD_JOB_ID']=job.id as a SAFETY NET but only under max_concurrent==1 (sequential execution → no race). Save the prior value and restore in finally so the env never leaks past this job's execution. Under max_concurrent>1 the manager still does not touch os.environ; the executor MUST call subprocess_env() and pass the result as env= to each spawned subprocess. Regression tests cover both paths: the safety net is present under max_concurrent=1 and absent under max_concurrent>1. Finding 2 — legacy-format CSV (no attempted_models, no job_id) is not migrated; jobs sharing it still cross-count The 8th-pass mid → new migration handled the mid-format header case but left the oldest layout untouched. Two same-command jobs explicitly sharing a legacy CSV produced rows without job_id, the watcher fell back to command+timestamp, and one counted the other's spend ($30 visible to job-a from $10 + $20). Fix: add _migrate_legacy_to_new_header that rewrites the file in place adding BOTH attempted_models and job_id columns (existing rows get empty values for both). Triggered when track_cost would append to a legacy CSV AND PDD_JOB_ID is set. After migration the watcher's strict job_id filter activates and shared legacy files gain real per-job isolation. When PDD_JOB_ID is empty (CLI use outside the server), no migration runs and the legacy warn-and- append path is preserved. Finding 3 — watcher does not detect track_cost's in-place migration (new inode at same path) track_cost's migration uses os.replace, which yields a new inode at the same path. The watcher's cached _byte_offset and _fieldnames remained based on the pre-migration file, so it never re-parsed the new header and the job_id filter never activated after migration — job-a still saw job-b's spend on the same CSV. Fix: track _known_inode in the watcher. On each poll, stat the file and compare st_ino against the cached value; on mismatch call _reset_tail_state() to discard the stale offset and header cache so the new header (which now carries job_id) is re-parsed on the next poll. Regression test seeds a mid-format CSV with a job-a row, lets the watcher see it under legacy fallback, then runs the migration helper and appends a job-b row; asserts the post-migration watcher's strict filter yields $0 (job-a's migrated row has empty job_id, job-b's row matches the wrong job_id). Test infra: added an autouse fixture in test_budget_control.py that saves/restores os.environ['PDD_JOB_ID'] around each test so the safety net's intentional persistence does not contaminate sibling test files that assume the env is clean. All 578 server + budget + track_cost tests pass locally.

…mpts current Tenth review pass. Three runtime fixes paired with regression tests, plus the three prompt files re-synced so the next regeneration does not erase any of this pass's fixes. Finding 1 — safety net set PDD_JOB_ID but not PDD_OUTPUT_COST_PATH Legacy custom executors that do not call subprocess_env() relied on the safety net for env propagation. The prior safety net set only PDD_JOB_ID; the child subprocess had no cost-CSV path so track_cost wrote nothing, the watcher saw $0, and a capped run completed without enforcement. Fix: when max_concurrent==1, set BOTH PDD_JOB_ID and PDD_OUTPUT_COST_PATH (or explicitly remove the latter if no per-job CSV was resolved). Save and restore both in finally so the env is never leaked past this job. Regression test asserts the executor sees BOTH vars and that both are restored after the run. Finding 2 — concurrent migrations race on shared .migrate.tmp Both migration helpers used a single hard-coded .migrate.tmp path with no lock. Two processes simultaneously appending to the same legacy/mid CSV could each read, each write their own tmp version, and the second writer's os.replace would clobber a row the first writer appended between the migration's read and replace. Fix: - per-writer unique tmp filename: <csv>.migrate.tmp.<pid>.<uuid> (new helper _unique_tmp_path) so two concurrent migrations cannot collide on the same tmp file. - POSIX fcntl.flock on a sidecar <csv>.migrate.lock around the whole migration (new _MigrationLock context manager). The second concurrent attempt receives a non-blocking lock failure, returns False, and the helper skips migration. The calling track_cost re-reads the header AFTER the migration call to detect the lock-contention skip and writes its row in whatever format the file is currently in — never with the new fieldnames against an unmigrated header. - graceful no-op when fcntl is unavailable (non-POSIX hosts): the lock returns False and we fall back to writing in the existing format. No corruption, slightly degraded isolation. Regression tests cover _unique_tmp_path uniqueness and in-process flock serialization (one thread acquires, the other is blocked). Finding 3 — prompts diverged from runtime contracts again After the prior pass added the safety net and migration paths, the prompts still said the manager 'never mutates os.environ', legacy/mid CSVs 'are never rewritten', and legacy shared CSVs 'are not isolated'. Regeneration would erase the runtime fixes. Fix: - server/jobs_python.prompt: documents the two-path env story — primary subprocess_env() contract (concurrency-safe) and the max_concurrent==1 os.environ safety net for legacy executors (with try/finally restore semantics). - track_cost_python.prompt: explains the legacy → new and mid → new migration paths, the unique per-writer tmp filename, the fcntl.flock on the sidecar .migrate.lock, and the lock-contention skip-and-re-read pattern. - cost_budget_watcher_python.prompt: explains the inode- change detection required to pick up post-migration headers and activate the strict job_id filter. Performance note (reviewer): legacy/mid migration is O(file_size) on first server-managed write to a pre-existing CSV. This is a one-time cost — subsequent writes append in new format. Acceptable. All 581 server + budget + track_cost tests pass locally.

…inked, prompt updated Eleventh review pass. Three runtime fixes + prompt re-sync to match. Finding 1 — concurrent append during migration loses rows The prior LOCK_NB skip-and-fall-back let one writer fall through to a legacy append while the other writer's migration os.replaced the file with a pre-append snapshot, silently deleting the contender's row. Reproduced in the wild: before_replace contained model_b, after_replace did not. Fix: replace _MigrationLock with _WriteLock that wraps the ENTIRE write block (read header, maybe migrate, append row) and uses a BLOCKING fcntl.flock(LOCK_EX). Contenders wait for the holder to release, then proceed with the post-migration header visible. The migration helpers themselves no longer take the lock — the caller (track_cost wrapper) holds it for the whole block. The helpers' docstrings call out the caller-must-hold invariant. Backwards-compat _MigrationLock alias retained. Finding 2 — unlinking the sidecar lock file allowed inode reuse The prior __exit__ unlinked <csv>.migrate.lock as cleanup. A later process opening the same path created a NEW inode, while the previous holder was still closing the OLD inode; two processes ended up with exclusive locks on different inodes for the 'same' lock path. Fix: do NOT unlink the lock file. The sidecar is tiny and persistent; the inode stays stable across all callers as long as nobody removes it. Class docstring documents the trade-off. Finding 3 — track_cost prompt contradicted itself on rewrites Row 5 said 'never truncated or rewritten in-place outside the legacy-header migration path' while row 8 described the migration as a rewrite, leaving the contract ambiguous to a regeneration pass. Fix: rewrite row 5 to say the file is APPEND-ONLY for ordinary writes and the ONLY rewrite path is the documented legacy-header migration; row 6 (concurrency) now explicitly describes the per-file flock that serialises the whole write block; row 8 describes the blocking lock semantics, the never-unlink rule, and the per-writer unique tmp filename. The three rows now agree on a single coherent contract. Test infra: relaxed track_cost mock-open assertions from assert_called_once_with to assert_any_call so the additional open() call on the sidecar lock file does not break unrelated tests. The _WriteLock catches all Exceptions on the open/fcntl path so mock-environment fileno() values don't crash production code (the lock degrades to 'unenforced' which test cases tolerate). All 582 server + budget + track_cost tests pass locally.

…t_budget, ms-precision started_at Twelfth review pass. Four runtime fixes (Findings 1+2+3 plus a precision bug surfaced while testing them) and three regression tests. Finding 1 — fast-exit jobs miss the final cost row A subprocess that writes its final cost row and exits in <2s never gets observed by the daemon thread's next poll (cleanup stops the watcher first), so the cap is silently not enforced. Reproduced with a $5 row under a $1 cap completing as COMPLETED. Fix: Watcher.flush() now returns bool (fired or not) so callers can await the status flip. JobManager._execute_job runs _final_watcher_flush BEFORE the result-handling block: flush synchronously consumes any new bytes, fires on_exceeded if the cap is crossed, and yields to the event loop up to 50 times waiting for _handle_budget_exceeded to set BUDGET_EXCEEDED. The terminal-status guard on the success branch then preserves it (the COMPLETED assignment is skipped). A secondary flush in the finally block remains for the exception path. Regression test seeds the CSV from a custom executor with current timestamp + correct job_id and asserts status reaches BUDGET_EXCEEDED. Finding 2 — /pdd settings reports $0 on uncapped runs Uncapped runs have no daemon watcher, so get_budget had no source of live spend during the run — job.cost is only set on subprocess exit, and the budget store snapshot is only created when a watcher starts. Fix: new module-level helper cost_budget_watcher.read_spent_now() does a one-shot synchronous CSV read using the same filtering logic as Watcher (commands, started_at, job_id). get_budget falls back to it when no watcher exists. /pdd settings on an active uncapped job now reports the actual accumulated spend. Finding 3 — late /pdd budget on fast-exiting uncapped job When update_budget started a watcher for the first time, the daemon thread might not poll before the job exited, so existing rows in the uncapped window were not enforced. Fix: update_budget calls watcher.flush() synchronously after starting or updating, so any pre-update rows trigger the cap immediately if crossed. Regression test pre-writes $5 of spend, applies a $1 cap via update_budget, and asserts BUDGET_EXCEEDED fires. Precision bug (surfaced by Finding 1's test) track_cost writes timestamps at millisecond precision (via isoformat(timespec='milliseconds')). JobManager records job.started_at with microsecond precision (datetime.now). A row written in the SAME millisecond as started_at has a parsed timestamp strictly less than started_at and was silently dropped by the watcher's ts < started_at filter. Fix: _normalize_started_at truncates to millisecond precision so the comparison aligns with the writer's resolution. Documented in the function docstring with the cross-module rationale. All 585 server + budget + track_cost tests pass locally.

…t_budget Thirteenth review pass. Three runtime correctness fixes around the watcher / read paths. Finding 1 — concurrent flush + daemon double-count the same bytes _consume_new_bytes was unlocked, so the daemon thread's poll and an inline flush() could each read from the same _byte_offset, parse the same rows, and each add cost to _spent. Reproduced as a single $5 row inflating to $10 of spend. Fix: lock changes to RLock (reentrant — flush already takes the lock after consume returns), and wrap the entire consume operation in self._lock via a new _consume_new_bytes_locked inner method. Two threads (or daemon+inline) now serialise; a row contributes to _spent exactly once. Regression test races two threads on watcher.flush() and asserts spent==5.0. Finding 2 — read_spent_now spawned a daemon thread per call The prior implementation constructed a real Watcher (which starts a background thread) and then called _consume_new_bytes inline; the two paths raced and double-counted, turning a $20k CSV into a reported $40k. Per-call thread spawn was also pure overhead. Fix: rewrite read_spent_now as a pure function — direct csv.DictReader + filter + sum, no Watcher, no thread, no shared state. Filter semantics (commands set, started_at, job_id when column present) match Watcher's _row_matches exactly so results agree. Two regression tests: (a) 20 sequential reads of a $10 CSV must each return $10 AND create no persistent threads; (b) 100×$5 rows must sum to $500 (a double-count bug yields $1000). Finding 3 — /pdd settings on capped run reports stale spend get_budget called watcher.spent() which returned the daemon's last cached value — up to poll_interval (2s) stale. A row written immediately before /pdd settings would be invisible until the next poll. Fix: get_budget now ALWAYS calls read_spent_now() for the freshest spend, regardless of whether a watcher is running. The watcher's cache is no longer consulted (it cannot be fresher than a direct CSV read). Capped runs and uncapped runs share the same code path. Regression test seeds a $3 row immediately after the watcher starts and asserts /pdd settings reports $3 without waiting for the next poll. All 589 server + budget + track_cost tests pass locally.

…e, job.cost reconciled, perf Fourteenth review pass. Three runtime correctness fixes + a perf improvement (the reviewer's full-CSV-scan concern), each paired with a regression test. Finding 1 — update_budget can return before BUDGET_EXCEEDED is applied update_budget called watcher.flush() which schedules _handle_budget_exceeded via run_coroutine_threadsafe. The caller returned immediately; the executor's exit path then set COMPLETED before the handler ran, and the handler's status-active gate short-circuited the cancel — final status was COMPLETED, no event. Reproduced as: spent $5, cap $1, status COMPLETED. Fix: after flush(), if cumulative spend already crosses the new cap, await up to 50 × 10ms for job.status to enter _TERMINAL_STATUSES. The wait covers BOTH paths — the inline flush fired, AND the watcher daemon thread fired between _start_watcher_for and our flush (in which case flush returns False because _state.fired is already True, but the scheduled handler is still queued). Regression test seeds a $5 row, applies a $1 cap, and asserts status==BUDGET_EXCEEDED immediately after update_budget returns — no extra sleep. Finding 2 — queued job's watcher had no time filter job.started_at is None until the executor runs. _start_watcher_for passed started_at=job.started_at, so the watcher had no lower time bound and counted historical rows in a shared legacy CSV. A day-old $99 row could cancel a queued job under /pdd budget 50 before the job even started. Fix: _start_watcher_for uses job.started_at or datetime.now(timezone.utc) as the baseline. A queued job's watcher starts counting only from the moment the cap was set; once the job actually runs, started_at >= the baseline so no rows are missed. Regression test creates a queued job behind a long-running hold job, points it at a shared legacy CSV with a day-old $99 row under /pdd budget 50, and asserts the queued job is not cancelled by the historical row. Finding 3 — /jobs/{job_id}.cost did not match /pdd settings spend job.cost was only set from the executor's returned dict — often 0 for custom executors that spawn subprocesses they don't track themselves — even though track_cost rows recorded real spend. /jobs/{job_id} reported cost=0; /pdd settings reported real spend; the two endpoints disagreed. Fix: in _execute_job's finally block, sync job.cost to max(job.cost, _compute_csv_spend(job)) BEFORE emit_complete and watcher cleanup. _compute_csv_spend is a new helper that uses the watcher's flush+spent (incremental — cheap) when a watcher exists, falling back to read_spent_now for uncapped runs. Same helper is used by get_budget so the two endpoints can never diverge. Regression test asserts job.cost==5.0 after a custom executor returns {'cost': 0.0} but writes a $5 CSV row. Perf — get_budget no longer always full-scans the CSV Reviewer measured ~0.26s on a 10MB / 200k-row shared CSV. get_budget now uses watcher.flush() + watcher.spent() when a watcher exists (incremental tail — only NEW bytes are parsed, not the whole file). Uncapped runs still use read_spent_now, but their per-job CSVs are bounded by one job's row count. The expensive full scan only happens for explicit shared-CSV uncapped runs. All 592 server + budget + track_cost tests pass locally.

Five findings raised in the post-implementation code review of the GitHub App budget surface. Each is paired with a regression test. Finding 2 — cap-clear via REST not possible `POST /commands/jobs/{job_id}/budget` collapsed "field omitted" and "field explicitly None" onto the same "leave unchanged" behaviour, so a caller (slash-command webhook or programmatic) could never drop a previously-set budget_cap back to "no cap". The prompt's `BudgetStore.update` contract documents this distinction explicitly. Switch the route to forward only fields present in `request.model_fields_set`, and rewrite the `BudgetUpdateRequest` model_validator to enforce "at-least-one-field-set" via `model_fields_set` rather than value-is-None — so an explicit `{"budget_cap": null}` now passes validation and reaches `update_budget` as None (clear), while `{}` still 422s. Finding 3 — parser prompt R6 silent on non-issue rejection The generated parser rejects `/pdd budget node N` and `/pdd budget max N` on non-issue active commands with a message redirecting to `/pdd budget N`, but the prompt's R6 did not document this stricter behaviour. The next `pdd sync` could regenerate a permissive parser and pass review. Update R6 to spell out the rejection plus the rationale (`effective_cap()` ignores node_budget/max_total_cap for non-issue commands; accepting silently would set fields the cap math ignores). Finding 4 — amount validation duplicated three places `validate_amount` (canonical, in `budget_settings.py`) was mirrored line-for-line in two pydantic field validators on `CommandRequest` and `BudgetUpdateRequest`, so changing the ceiling required touching three places. Extract a single module-level `_coerce_budget_amount_value` helper in `models.py` that both pydantic validators delegate to. The helper is kept in `models.py` (not imported from `budget_settings`) to avoid an import cycle — documented in the helper's docstring. Finding 5 — `cancel_job` 200s on a BUDGET_EXCEEDED job The early-409 guard at the cancel route enumerated only `COMPLETED`/`FAILED`/`CANCELLED`, so a cancel posted after the watcher already terminated the job returned 200 with `cancelled=True` — misleading. Add `BUDGET_EXCEEDED` to the guard and document at `JobManager.cancel` that the manager still needs to handle `BUDGET_EXCEEDED` (the budget-handler path calls `cancel()` to actually terminate the subprocess AFTER flipping the status — so the manager's early-exit set intentionally excludes `BUDGET_EXCEEDED`). All 145 budget-control tests, 455 server + track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three correctness/contract gaps surfaced by a stronger reviewer. Each is paired with a regression test. Finding A — pdd-issue budget_cap was a silent no-op `effective_cap("issue", ...)` IGNORES `budget_cap`; only `node_budget` and `max_total_cap` participate in the issue cap formula. A webhook handler literally forwarding `/pdd budget N` on a pdd-issue job as `{"budget_cap": N}` would therefore set a field the cap math never reads and the watcher would never enforce the requested limit. Reproduced live: `effective_cap("issue", budget_cap=30, node_budget=80, max_total_cap=400, node_count=1)` returns 80.0, not 30. Both routes (`POST /commands/execute` and `POST /commands/jobs/{job_id}/budget`) now re-alias a bare `budget_cap` to `max_total_cap` when the job's command is `issue`. The alias yields to an explicit `max_total_cap` so callers that send both keep the more-specific value. Prompt R3 / R6 references updated in commands_python.prompt; the parser already produces `budget_max_set` for this case, but the routes must finish the alias for callers that bypass the parser. Finding B — daemon-fired race could end the job COMPLETED When the watcher's daemon poll observed the cap crossing but the scheduled `_handle_budget_exceeded` coroutine had not run before the executor exited, `_final_watcher_flush.flush()` returned False (the fire-once guard short-circuits when `_state.fired` is already True). The final-flush helper treated False as "nothing to wait for" and let the executor set `COMPLETED`; the handler then bailed because the job was no longer in QUEUED/RUNNING — final status was COMPLETED, no `budget_exceeded` event emitted. Expose a `Watcher.fired()` signal (watcher-wide "has the cap been crossed?" rather than flush()'s "did THIS call fire?") and use it as a second wait condition in `_final_watcher_flush`. Now the flush either fires inline (returns True) OR sees the daemon already fired (`fired()` True), and in both cases the helper waits up to 50×10ms for terminal status before control returns to the executor. Finding C — runtime APIs not declared in prompt interfaces `Watcher.flush`, `Watcher.fired`, `read_spent_now`, `JobManager.subprocess_env`, `JobManager.update_node_count`, the `node_count` field on `BudgetUpdateRequest`, and the `node_count` kwarg on `update_budget` were all used at runtime but not declared in the prompts' `<pdd-interface>` blocks. A future `pdd sync` / `pdd generate` could legitimately drop any of them — the re-broken surface would re-introduce the race in Finding B AND silently regress the fresh-spend path on `/pdd settings`. Declare each in the matching prompt interface and in architecture.json so the conformance check enforces them on regeneration. Also tighten the R6 contract docs in `commands_python.prompt` for the alias + null-clear semantics (so a regenerated route preserves them). All 152 budget-control + 455 server + 29 track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ands A failed second tracked command sharing one `ctx.obj` with a prior successful command used to read the prior command's `partial_cost` and `last_model` and write them into its own CSV row. Concretely: command A invokes the LLM, `llm_invoke._publish_call_outcome_to_ctx` populates `ctx.obj["partial_cost"] = 7.0` and `ctx.obj["last_model"] = "model-a"`. Command B starts on the same `ctx.obj`, raises before any LLM call, and `track_cost`'s exception-path fallback reads `partial_cost` / `last_model` from `ctx.obj` — which still hold A's values. B's row therefore claims B spent $7.0 on `model-a`, inflating cumulative spend and potentially tripping a budget cap on a command that itself spent nothing. `attempted_models` was already snapshotted/cleared/restored for exactly this reason; extend the same pattern to `partial_cost` and `last_model`. Pop+restore (not just pop) so a nested track_cost invocation inside a parent tracked command preserves the parent's accumulated state on the way back up. Also rewrites the daemon-fired race regression test to use a custom executor that writes its cost row mid-flight (so it passes the watcher's `started_at` filter) and sleeps long enough for the daemon to poll and queue the handler before the executor returns — the prior version of this test pre-seeded the CSV before submit, which `started_at` correctly filtered out so the race window was never exercised. All 609 budget-control + server + track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three more contract/correctness gaps surfaced by a stronger reviewer on the updated PR head. Each is paired with a regression test. Finding A — non-scalar budget JSON returned HTTP 500 `{"budget_cap": []}` reached `float([])`, which raises `TypeError`, which FastAPI converts to HTTP 500 — a malformed client request looked like a server error. The shared coercion helper now adds an explicit `isinstance(value, (int, float))` branch; non-numeric, non-string types fall to an explicit `ValueError`, which pydantic surfaces as HTTP 422 (the right outcome for invalid request JSON). Finding B — null `budget_cap` on pdd-issue was a silent no-op The route's pdd-issue alias rule only fired when `budget_cap` was non-None, so a caller posting `{"budget_cap": null}` on an issue job intending "clear the alias" left the old `max_total_cap` active and the visible clear was a silent no-op. Aliasing an explicit None too is what makes the clear actually clear: the route now pops `budget_cap` (numeric OR explicit None) into the `max_total_cap` slot whenever the caller did not explicitly set the latter. Finding C — no typed `budget_exceeded` event on the WebSocket `JobManager.callbacks.emit_budget_exceeded` fires when the watcher trips, and `BudgetExceededMessage` exists in the models module, but `create_websocket_routes` only registered `on_output` and `on_complete`. Subscribed clients never received the typed payload — only the subsequent `complete` message, with no way to distinguish budget abort from clean completion. Add `emit_job_budget_exceeded(job, spent, effective_cap)` and register an `on_budget_exceeded` callback that assembles the typed `BudgetExceededMessage` (carrying pdd-issue specifics: node_budget, max_total_cap, node_count) and broadcasts via `manager.broadcast_job_message`. The websocket prompt's `<pdd-interface>` requirements list is extended to lock the new helper + registration in for future regeneration. The route-test fixture's mock models module is updated to expose `BudgetExceededMessage` so the collection-time import in websocket.py resolves. All 620 budget-control + server + track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three runtime-correctness gaps surfaced by another deeper review pass. Each is paired with a regression test that reproduces the broken behaviour on the prior code. Finding A — orphan watcher when /pdd budget is set on a queued job `update_budget` on a still-queued job with no active watcher (cap was None at submit) calls `_start_watcher_for` to spin one up. When `_execute_job` later runs, it calls `_start_watcher_for` AGAIN. The dict assignment `self._watchers[job.id] = new_watcher` overwrote the prior Watcher without calling its `.stop()` — the previous daemon thread kept polling the CSV, double-counting spend against the new sibling watcher, and could fire `on_exceeded` long after the job had moved past QUEUED/RUNNING (where the handler then silently no-ops). `_start_watcher_for` now calls `_stop_watcher_for(job.id)` at the top so the second call replaces the previous Watcher cleanly. The stop happens BEFORE the cap re-compute so a code path that ends up returning None (cap is now None) still flushes the previous watcher rather than leaking it. Finding B — WebSocket reconnect hangs on BUDGET_EXCEEDED jobs `websocket_job_stream`'s "job already done — send result + close" branch enumerated `[COMPLETED, FAILED, CANCELLED]`. A client reconnecting after the watcher tripped the cap fell through to the input-loop and waited forever on `websocket.receive_text()`. Add `BUDGET_EXCEEDED` to the terminal-state set; the reconnect path now sends the typed `BudgetExceededMessage` FIRST (so the client renders the cap-trip UI without an extra REST round-trip), then the standard `complete` summary, then closes. Finding C — `complete` could be emitted before `budget_exceeded` `_handle_budget_exceeded` awaited `cancel()` BEFORE `emit_budget_exceeded`. cancel() injects CancelledError into the executor task; the resulting finally block emits the `complete` callback. The event loop could schedule the finally-block emit ahead of the still-awaiting `emit_budget_exceeded`, so subscribers that close on `complete` (the common client pattern) missed the typed payload entirely. Swap the order: emit the typed callback first, then cancel. The typed event is now the LAST event observed before `complete`, which is the contract the WebSocket consumer needs. All 623 budget-control + server + track_cost tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three runtime lifecycle gaps from another deep-review pass. Each is paired with a regression test that fails on the prior code. Finding A — watcher orphaned when /pdd budget is set on a queued job and /pdd stop cancels before _execute_job runs `update_budget` starts a Watcher on a still-queued job via the "no watcher running yet" branch. If `/pdd stop` cancels the task before `_execute_job` ever runs, the cleanup site — `_execute_job`'s finally block — never executes and the watcher's daemon thread keeps polling forever. Move watcher cleanup into the `_on_task_done` callback registered by `submit`, which fires unconditionally on task completion (cancelled-while-queued, cancelled-while-running, or normal exit). `_stop_watcher_for` is already idempotent (pop()-based) so the original finally-block call now becomes a no-op when task-done already cleaned up. Finding B — slow subscriber callback delayed subprocess termination `_handle_budget_exceeded` `await`ed `emit_budget_exceeded` BEFORE `cancel()`. A slow `on_budget_exceeded` subscriber (e.g. a stalled WebSocket consumer) blocked the await chain for seconds, during which the subprocess kept running AND spending. Add a synchronous `_signal_cancel(job_id)` helper that sets the cancel event and SIGTERMs the subprocess WITHOUT awaiting anything. The handler now: (1) sets status + updates store; (2) signal-cancels synchronously (subprocess immediately starts dying); (3) awaits emit_budget_exceeded (slow client is fine — the process is already gone); (4) awaits full cancel() for task teardown + SIGKILL escalation. The typed event still beats `complete` to the wire because step 3 runs before step 4 emits `complete` via the executor's finally block. Finding C — reconnect payload aliased effective_cap to job.cost The WebSocket reconnect branch for a BUDGET_EXCEEDED job set `effective_cap=float(job.cost)`, so cap $400 with spend $401.23 reported back `effective_cap=$401.23` — collapsing the two values onto one number. The reconnect client had no way to know what the active cap was at the moment of crossing. Recompute the cap via `budget_settings.effective_cap(job.command, ...)` on the reconnect path; fall back to `job.cost` only when the budget_settings module is unavailable. Also updates `jobs_python.prompt` and `websocket_python.prompt`'s requirements lists to document the three-step termination sequence, the idempotent `_start_watcher_for`/task-done cleanup, and the recompute-effective-cap-on-reconnect contract — so a future `pdd sync` regeneration cannot silently regress any of them. All 627 budget-control + server + track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three runtime-ordering gaps from another deep review pass. Each is paired with a regression test that fails on the prior code. Finding A — complete could still beat budget_exceeded to the wire Even after the prior pass added `_signal_cancel`, a slow `on_budget_exceeded` subscriber could let `_execute_job.finally` emit `complete` BEFORE `emit_budget_exceeded` finished. The contract that subscribers closing on `complete` still see the typed event was racey. Add a per-job `_budget_broadcast_done: Dict[str, asyncio.Event]` that `_handle_budget_exceeded` exposes BEFORE signalling cancel and sets AFTER awaiting `emit_budget_exceeded`. `_execute_job`'s finally now waits on this event (bounded by a timeout) before calling `emit_complete`, so the typed event always lands first. Critical follow-on: drop the prior `await self.cancel(job_id)` at the end of `_handle_budget_exceeded`. `cancel()` calls `task.cancel()`, which injects `CancelledError` into the executor task while its finally is awaiting the broadcast-done gate — the wait re-raises `CancelledError` and SKIPS `emit_complete`, breaking the new ordering. Subprocess termination is already handled by `_signal_cancel` (SIGTERM) plus `_escalate_kill` (SIGKILL escalation, finding B), so letting the executor exit naturally is both sufficient and necessary. Finding B — SIGKILL escalation delayed by slow callback chain `_signal_cancel` only sent SIGTERM. A subprocess that ignores SIGTERM was only force-killed inside `cancel()`, which used to be awaited AFTER `emit_budget_exceeded`. A slow callback could leave the stubborn child running for the full callback duration. Add a synchronous `_signal_cancel` companion: `_escalate_kill(job_id)`, spawned via `loop.create_task` when a real subprocess is actually running. It polls the process for ~2s and SIGKILLs if still alive. The kill is now independent of any await chain, so spend stops accumulating immediately. Finding C — live WebSocket stream never closes on terminal job `websocket_job_stream`'s input loop only `await`ed `receive_text()`. When the job reached a terminal state the route had no signal to break the receive_text wait — the connection stayed open indefinitely after `complete`. The loop now races `receive_text` against a 250ms job-status poll via `asyncio.wait(..., FIRST_COMPLETED)` and closes the WebSocket when the job becomes terminal. Subscribers no longer leak. The `jobs_python.prompt` requirements list is also updated to document the new termination sequence (5-step) and the explicit "do not call self.cancel() from the budget handler" invariant so the next `pdd sync` cannot regenerate the prior behaviour. All 175 budget-control tests + 91 route tests pass in their canonical subsets. (One pre-existing test-isolation flake between `tests/test_budget_control.py::TestMidFormatCsvMigration` and `tests/server/routes/` collection — documented as a known issue in `tests/server/conftest.py` — is unrelated to this change and reproduces on prior commits when the same combo is run.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

prompt-driven-github Bot mentioned this pull request May 21, 2026

Add /pdd budget control comments for GitHub App runs (clean Opus rerun) #1128

Open

Serhan-Asad added 3 commits May 22, 2026 11:10

Merge upstream/main into change/issue-1128

d3ba80f

Resolves architecture.json conflict by keeping main's updated description for agentic_checkup_orchestrator while preserving this PR's added module entries for cost_budget_watcher, budget_settings, slash_command_parser, and budget_comments.

Serhan-Asad and others added 19 commits May 22, 2026 11:36

Serhan-Asad and others added 3 commits May 23, 2026 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add /pdd budget control comments for GitHub App runs (#1128)#1131

feat: add /pdd budget control comments for GitHub App runs (#1128)#1131
prompt-driven-github[bot] wants to merge 26 commits into
mainfrom
change/issue-1128

prompt-driven-github Bot commented May 21, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Serhan-Asad commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prompt-driven-github Bot commented May 21, 2026

Summary

Changes Made

Prompts Modified

Prompts Created

Documentation Updated

Review Checklist

Next Steps After Merge

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented May 22, 2026

Finding 1 — prompt/docs only, no generated .py

Finding 2 — CSV write timing vs "between LLM/tool calls"

Finding 3 — pdd-issue cost attribution

Finding 4 — SlashCommandResult.metadata inconsistency

Finding 5 — auth gate vs /pdd settings

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Finding 1 — prompt/docs only, no generated `.py`

Finding 3 — `pdd-issue` cost attribution

Finding 4 — `SlashCommandResult.metadata` inconsistency

Finding 5 — auth gate vs `/pdd settings`