feat: add /pdd budget control comments for GitHub App runs (#1128)#1131
feat: add /pdd budget control comments for GitHub App runs (#1128)#1131prompt-driven-github[bot] wants to merge 26 commits into
Conversation
Adds the full GitHub App control-comment surface for budget management:
startup settings comments, /pdd budget, /pdd budget node, /pdd budget max,
/pdd settings, /pdd stop, with reactive active-run updates and budget
enforcement for both normal commands and pdd-issue defaults.
Prompt changes (MODIFY):
- prompts/server/jobs_python.prompt: thread budget_cap/node_budget/
max_total_cap/node_count, add update_budget, integrate watcher, add
BUDGET_EXCEEDED status.
- prompts/server/models_python.prompt: BudgetSettings,
BudgetUpdateRequest, BudgetExceededMessage, SlashCommandResult, new
enum value.
- prompts/server/routes/commands_python.prompt: GET/POST
/commands/jobs/{job_id}/budget; defaults wiring for pdd-issue.
- prompts/track_cost_python.prompt: formalize read contract for watchers.
Prompt changes (CREATE):
- prompts/cost_budget_watcher_python.prompt: daemon-thread CSV poller
with update_cap and idempotent stop.
- prompts/server/budget_settings_python.prompt: per-job store +
effective-cap formula + pdd-issue defaults + validators.
- prompts/server/slash_command_parser_python.prompt: pure parser with
fenced-block / bot / dedupe handling + authorization helpers.
- prompts/server/budget_comments_python.prompt: pure Markdown renderers
matching the issue's literal strings.
Docs:
- README.md: new "GitHub App control comments" subsection.
- CHANGELOG.md: Unreleased entry referencing #1128.
Closes #1128
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Resolves architecture.json conflict by keeping main's updated description for agentic_checkup_orchestrator while preserving this PR's added module entries for cost_budget_watcher, budget_settings, slash_command_parser, and budget_comments.
Closes the four spec-level bugs flagged on PR #1131: Finding 2 (enforcement boundary): cost_budget_watcher and the README claimed enforcement happens "between LLM/tool calls", but track_cost only appends a row when a PDD subprocess exits. Documents enforcement as a SUBPROCESS boundary, not a mid-call boundary; explains the pdd-issue case (works well: many nested subprocesses) and the single- command case (cap can be overshot by one subprocess's spend) so readers do not over-claim. Finding 3 (pdd-issue cost attribution): the watcher used to take a single command=str filter. pdd-issue itself never writes a row with command="issue"; it spawns nested PDD subprocesses (change, sync, bug, fix, generate, test, ...) and each writes its own row. Filtering on "issue" summed to $0 and the cap would never fire. Changes the watcher to accept commands: Optional[Iterable[str]] (a set of accepted command names) and documents in jobs_python.prompt which set to pass for pdd-issue vs single-command jobs, with an explicit warning not to reintroduce the {"issue"} bug. Finding 4 (SlashCommandResult.metadata): the parser stored the validated amount in a metadata dict, but the model did not define a metadata field. Pydantic would either reject the extra field or drop the amount. Adds metadata: Dict[str, Any] (default_factory=dict) to SlashCommandResult in models_python.prompt and updates the parser doc to make the contract explicit (always a concrete dict; {"amount": <float>} for budget- mutating kinds, empty dict otherwise). Finding 5 (auth gate inconsistency): parser R5 used to gate ALL /pdd comments by is_authorized before parsing, which contradicted both the README ("other commenters can use /pdd settings") and render_unauthorized ("refer the user to /pdd settings"). Rewrites R5 to parse first, then gate by verb: budget-mutating verbs (budget_*, stop) require is_authorized; the read-only /pdd settings verb is open to anyone whose comment is parsed (subject to bot/fenced/dedupe filters). Updates the README and budget_comments R7 to match. Also reflects the watcher's command -> commands rename and the model's new metadata field in architecture.json so the next pdd sync does not flag drift. Includes a clean merge of upstream/main; resolves the architecture.json description conflict in agentic_checkup_orchestrator. Finding 1 (no generated runtime code) is intentionally NOT addressed in this commit: this PR was scoped as prompt-only with code generation in the original "Next Steps After Merge" plan. See the PR reply for the follow-up plan.
…pted-commands set
The reader-contract narrative said "Watchers filter the CSV by command"
(singular). The watcher's parameter is now commands (a set of accepted
names), so future regenerations of track_cost could bake the singular
assumption back in and silently reintroduce the pdd-issue zero-spend
bug. Documents the set form, and includes an explicit anti-pattern note
for the {"issue"} case.
|
Thanks — all five findings hold up to primary-source verification. Findings 2–5 are now addressed in two commits on this branch (the merge of Finding 1 — prompt/docs only, no generated
|
… budget surface
Implements the four new public modules whose contracts the earlier prompt
edits defined, plus the matching updates to existing server modules and a
focused test suite. Brings the PR from spec-only to a runnable surface that
the private GitHub App webhook handler can wire to GitHub.
New modules
- pdd/cost_budget_watcher.py
Daemon-thread CSV poller. Accepts a set-based commands filter (not a
single name) so pdd-issue runs sum spend across nested subprocess
rows (change/sync/bug/fix/generate/test/...). R1-R7 honored:
fire-once, cap-update fires only on next crossing, no-cap is a
no-op, missing CSV is $0, .stop() is idempotent, never kills the
subprocess itself, tolerant of partial CSV rows.
- pdd/server/budget_settings.py
effective_cap() formula (issue: min(node_budget * max(node_count or
1, 1), max_total_cap); else: budget_cap), pdd_issue_defaults() ->
(80.0, 400.0), validate_amount() with the $10000 hard ceiling, and
a thread-safe BudgetStore.
- pdd/server/budget_comments.py
Pure Markdown renderers matching the issue's acceptance criteria
verbatim (R10 money formatting; $<int> for integers, $<value:.2f>
for fractions, always two decimals for 'Spent so far'). The
render_unauthorized reply explicitly redirects rejected commenters
to /pdd settings so the read-only verb's openness is visible.
- pdd/server/slash_command_parser.py
Pure parser. parse_comment never raises; classifies into
budget_set / budget_node_set / budget_max_set / settings / stop /
invalid / ignored. Sets metadata['amount'] on budget-mutating
kinds; bare /pdd budget N on a pdd-issue job aliases to
budget_max_set per R6. is_authorized is a stateless helper —
authorisation is the webhook handler's job, scoped to mutating
verbs only (R5).
Modified modules
- pdd/server/models.py
Adds JobStatus.BUDGET_EXCEEDED, BudgetSettings, BudgetUpdateRequest
(with field validators that coerce '$30'/'30.00' to float and
enforce > 0 and <= 10000), BudgetExceededMessage (WSMessage variant
emitted exactly once on cap crossing), and SlashCommandResult with
a concrete metadata: Dict[str, Any] field (default_factory=dict) so
Pydantic never drops the parser's amount.
- pdd/server/jobs.py
Job dataclass gets optional budget fields. JobManager learns the
submit budget_cap/node_budget/max_total_cap kwargs, update_budget
(with KeyError/RuntimeError/ValueError exception types as the
public contract), get_budget (snapshot for /pdd settings), and
watcher wiring keyed on resolved cost-CSV path (job.options
output_cost, falling back to PDD_OUTPUT_COST_PATH). On cap
crossing, _handle_budget_exceeded reuses the cancel path that
/pdd stop drives, marks status BUDGET_EXCEEDED, and emits the new
on_budget_exceeded callback. Watcher is stopped in finally so it
never outlives the job.
- pdd/server/routes/commands.py
Threads budget_cap/node_budget/max_total_cap from CommandRequest to
JobManager.submit; applies pdd_issue_defaults() ($80/$400) when
no explicit fields are present on a pdd-issue submission. Adds
GET /commands/jobs/{job_id}/budget (KeyError -> 404) and
POST /commands/jobs/{job_id}/budget (KeyError -> 404,
ValueError -> 400, RuntimeError 'job not active: ...' -> 409),
discriminating on exception type only — never on message text.
Tests
- tests/test_budget_control.py: 73 unit tests covering effective_cap,
validate_amount, BudgetStore, the parser (including Finding 3/4/5
regression guards: filtering on {'issue'} stays $0; metadata.amount
is set on budget kinds; /pdd settings returns kind='settings' with
empty metadata and is NOT gated by auth at parse time), all comment
renderers, and watcher behaviors (fire-once, cap update, no-cap
no-op, started_at filter, missing CSV, idempotent stop).
- tests/test_budget_control_real.py: @pytest.mark.real test that
regenerates the parser via pdd generate and re-asserts the Finding
4 (metadata.amount) and Finding 5 (read-only settings, R6 issue
alias) contracts. Guards against LLM drift on future syncs.
- tests/server/routes/test_commands.py: updated submit assertion to
include the new budget kwargs (test was previously asserting the
pre-budget call signature).
All 73 new tests + 499 existing server tests pass locally; full server
suite is clean. Watcher enforcement is documented as a subprocess
boundary (not 'between LLM/tool calls'), matching the README and
cost_budget_watcher / track_cost prompt updates already on this branch.
…, perf
Addresses follow-up review on the runtime budget surface; each fix ships
with a regression test that asserts the broken case stays broken before
the fix and passes after.
Finding 1 — naive vs aware timestamp comparison breaks enforcement
track_cost writes timestamps via datetime.now().strftime('%Y-%m-%dT...'),
which is NAIVE local time (track_cost.py wraps Click commands; the
reader-contract docstring says UTC but the actual writer is naive). The
watcher's job.started_at is timezone-aware UTC. Comparing the two
raises TypeError, which the watcher caught and swallowed at the poll
level — so every row was skipped, spend stayed at $0, and the cap
never fired in real production CSVs.
Fix: _parse_timestamp coerces naive CSV cells to UTC before comparison
(matching what the reader contract documented all along); _normalize_-
started_at does the same to the caller's started_at so naive callers
also work. Regression test seeds a naive-format row that asserts
spent() > 0.
Finding 2 — BUDGET_EXCEEDED status lost to CANCELLED race
_handle_budget_exceeded used to call cancel() FIRST and then assign
BUDGET_EXCEEDED. cancel() injects asyncio.CancelledError into the
running task; that lands in _execute_job's except handler which set
status=CANCELLED unconditionally. Depending on scheduling, the
CancelledError handler ran AFTER our BUDGET_EXCEEDED assignment and
silently demoted the status. /pdd settings and the final
budget-exceeded comment then lost the actual reason.
Fix: set status=BUDGET_EXCEEDED (and budget store snapshot) BEFORE
calling cancel(). Add a module-level _TERMINAL_STATUSES set and gate
every status assignment in _execute_job, _execute_wrapper, cancel(),
and _on_task_done so they never overwrite a terminal status — so even
if scheduling changes in the future, BUDGET_EXCEEDED is permanent
once set. Async regression test submits a job with a sleeping fake
executor, manually trips _handle_budget_exceeded, waits 500ms for
the cancel race to play out, and asserts status remains
BUDGET_EXCEEDED.
Finding 3 — POST /commands/execute rejects command='issue'
ALLOWED_COMMANDS did not include 'issue', so the validation gate at
the top of execute_command raised HTTP 400 before the
pdd_issue_defaults branch could run. The GitHub App's documented
call path (command='issue' with empty budget fields → defaults to
$80/$400) was unreachable.
Fix: add 'issue' to ALLOWED_COMMANDS with a description that names
the defaults branch. Regression tests assert both (a) command='issue'
with no budget fields applies the defaults and (b) command='issue'
with an explicit node_budget skips them.
Finding 4 — full CSV reread every poll is O(rows × polls)
Old _read_spent walked the entire DictReader every 2s. For a long
pdd-issue run with hundreds of nested-subprocess rows, this is
unbounded disk I/O per watcher.
Fix: rewrite as a true incremental tail. Track _byte_offset of the
first unread byte and a cached _known_size; on each poll, stat the
file, seek to _byte_offset, parse only the new bytes, advance the
offset, and add to a running _spent total. Partial trailing rows
(a row being flushed) are left for the next poll (rfind newline,
leave leftover bytes). Truncation/rotation (file shrinks) resets the
tail state and re-scans from zero so the watcher can never get
stuck pointing past EOF. Regression test patches csv.reader to
count invocations, appends 5 rows over 5 polls, and asserts the
reader call count stays bounded.
Finding 5 (nit) — CHANGELOG drift on enforcement boundary
CHANGELOG still said "enforces the active cap between LLM calls"
even after README/cost_budget_watcher/track_cost prompts had been
updated to say subprocess boundary. Reworded to match.
All 79 budget tests + 544 server/budget tests pass locally; no other
tests touched.
…cap callback
Three follow-up review findings on the runtime budget surface; each fix
ships with a regression test that reproduces the broken behavior on the
prior code and passes after the fix.
Finding 1 — non-UTC workers may still drop current-time rows
track_cost wrote naive local-time timestamps via datetime.now() and
the watcher reinterpreted naive cells as UTC outright. A naive
'11:30 local PDT' row was treated as '11:30 UTC' — which is 7 hours
EARLIER than the job's actual UTC start, so the row was filtered out
by the started_at gate and spend stayed at $0.
Fix: track_cost now writes timezone-aware ISO via
datetime.now(timezone.utc).isoformat(timespec='milliseconds'), so
rows carry an explicit '+00:00' offset and there is nothing to
reinterpret. The watcher's _parse_timestamp uses .astimezone() in
both branches: aware values are converted to UTC for a uniform
comparison frame; legacy naive cells (from older CSVs written before
this fix) are treated as local time and converted to UTC — never
reinterpreted as UTC outright. Existing track_cost regression
patterns updated to allow the new '+00:00' suffix.
Finding 2 — initial budget fields skipped validate_amount
CommandRequest and JobManager.submit accepted any float (negatives,
> $10000, NaN, inf) on initial submission. Only update_budget ran
validate_amount. A submission with budget_cap=-1 stored
effective_cap=-1.0 and behaved as if there were a cap, but the
arithmetic was nonsense.
Fix: add a field_validator on CommandRequest.{budget_cap, node_budget,
max_total_cap} that applies the same > 0 and <= $10000 rule as
BudgetUpdateRequest, and string-form coercion ('$30', '30.00').
JobManager.submit runs validate_amount on each non-None budget kwarg
before constructing the Job, so programmatic callers and tests are
guarded too — not just the REST route.
Finding 3 — budget-exceeded callback reported stale cap after /pdd
budget mid-run
_start_watcher_for closed over the cap value computed at submit and
passed it to _handle_budget_exceeded(job_id, spent, cap). When
update_budget lowered the cap mid-run, the watcher fired at the new
cap but the callback still reported the original.
Fix: drop the closure capture. _handle_budget_exceeded now takes
(job_id, spent) only and recomputes the current effective cap from
the (potentially updated) job.budget fields at fire time, so the
emitted BudgetExceededMessage carries the active cap a /pdd budget
comment would have set. Regression test submits cap=100, lowers it
to 5 via update_budget, trips _handle_budget_exceeded with
spent=10, and asserts the on_budget_exceeded callback receives
cap=5 (not 100).
… auto-wire CSV Third review pass on the runtime budget surface. Each fix has a regression test that reproduces the broken behaviour on the prior code. Finding 1 — /pdd budget node|max accepted for non-issue commands effective_cap() only consults node_budget / max_total_cap when command == "issue"; for any other command they are ignored. The parser nevertheless returned budget_node_set / budget_max_set for any active_command, so a user issuing "/pdd budget node 50" during a pdd-bug job would see the webhook acknowledge the change while the cap silently stayed put. Fix: _parse_budget rejects the node and max verbs with kind="invalid" and a message that points the user at "/pdd budget N" when active_command != "issue". Tests cover the rejection on bug/change/ unknown plus the still-accepted issue path. Pre-existing tests that exercised node/max without an active_command were updated to pass active_command="issue". Finding 2 — node_count not updateable through the public API pdd-issue's effective cap depends on node_count (min(node_budget * max(node_count or 1, 1), max_total_cap)). On a fresh submission node_count is None so effective_cap collapses to node_budget ($80) instead of the intended min($80 * N, $400). With no API path to push the growing node count, the run would stop too early unless the private executor mutated job.node_count out of band. Fix: BudgetUpdateRequest gains node_count: Optional[int] with its own validator (int, >= 0, <= 10000) and a model_validator that keeps the existing "at least one field set" rule covering all four fields. JobManager.update_budget accepts node_count via the _UNSET sentinel pattern (None = clear, _UNSET = leave alone). The /budget POST route threads the field through. A synchronous update_node_count helper is added for the subprocess driver thread (no awaitable round-trip needed). Tests verify the cap grows from 80 -> 240 -> 400 as node_count goes None -> 3 -> 10, and that node_count alone satisfies the "at least one" rule (the private executor pushes it standalone). Finding 3 — capped job with no cost CSV path silently bypasses enforcement _resolve_cost_csv_path returned None whenever options.output_cost and PDD_OUTPUT_COST_PATH were both unset; _start_watcher_for then early- returned without a watcher. The job kept advertising a cap that no one was enforcing. Fix: when an effective cap is active and no explicit CSV path is configured, derive project_root/.pdd/cost-<job_id>.csv, mkdir -p its parent (logging a warning if that fails), and inject the path into job.options["output_cost"] so the subprocess writes to the same file the watcher reads via --output-cost. Uncapped jobs continue to skip the derivation so we don't litter .pdd/ with empty files. Tests verify the capped/uncapped/explicit branches and that the parent directory is created.
… node_count, mkdir for explicit cost CSV
Fourth review pass; each finding paired with a regression test that
reproduces the broken behaviour on the prior code.
Finding 1 — default executor would spawn nonexistent `pdd issue`
ALLOWED_COMMANDS rightly accepts command='issue' so the route can
apply pdd-issue defaults and hand the job to the private GitHub
App's custom JobManager executor. But when JobManager has no
custom executor (i.e. the public default-subprocess path),
_run_click_command would still try to spawn `pdd issue`, which
exits with "No such command 'issue'" because the public Click
CLI has no `issue` subcommand. Operators reading the failure
reasonably conclude the public CLI is broken, when in fact the
job was misrouted.
Fix: in _run_click_command, detect command=='issue' BEFORE
building the subprocess args and raise RuntimeError with a
message that explains custom executors are required for this
command and points at the alternatives (sync/generate/bug/...).
Regression: a JobManager with no executor raises the new clear
error; a JobManager with a custom executor handles issue normally
(the public default path is never reached).
Finding 2 — node_count=3.9 silently truncated to 3
The earlier pydantic validator called int(v) directly, which
truncates 3.9 to 3 without complaint. Same hole in the
programmatic JobManager.update_budget / update_node_count
paths. effective_cap math then ran on the truncated value
without telling the caller — the same class of silent failure
the prior reviewer flagged on the parser's node/max verbs.
Fix: tighten both layers to reject fractional inputs.
BudgetUpdateRequest._coerce_node_count rejects fractional floats
(3.9 -> error) and fractional strings ('3.9' -> error) while
still accepting 3.0 (unambiguously integral, useful for JSON
callers) and '5' (string-int). A module-level
_coerce_node_count_strict helper applies the same rules in
JobManager so programmatic callers and the subprocess driver
thread get identical strictness.
Finding 3 — explicit options.output_cost paths lack parent dir
_resolve_cost_csv_path only mkdir-p'd the parent for derived
default paths (Finding 3 of the previous pass). When the client
passed an explicit options.output_cost like 'custom/cost.csv'
and 'custom/' did not exist, track_cost would attempt to write,
catch the OSError in its own swallow-all-errors block, and the
watcher would silently stay at $0.
Fix: in the explicit-candidate branch of
_resolve_cost_csv_path, also call path.parent.mkdir(parents=True,
exist_ok=True). On OSError, log a clear warning and continue
(the original behaviour) rather than refuse the path entirely
— the caller may have written perms after this point, and we
do not want to hard-fail a job submission over a path we
optimistically created early.
All 110 budget tests + 565 server/track_cost tests pass locally.
… exception
Fifth review pass. Each finding paired with a regression test.
Finding 1 — late /pdd budget cannot enforce on initially-uncapped runs
An uncapped job started without --output-cost; the subprocess
command line was fixed at spawn time. When the user later posted
/pdd budget 30, the watcher started but had no CSV writer, so
spend stayed at $0 forever and the documented "add a cap by
commenting /pdd budget 30" path was unenforceable.
Fix: _resolve_cost_csv_path is now called at submit() time
regardless of whether a cap is currently set. The derived per-job
path is injected into job.options["output_cost"] BEFORE the
subprocess starts, so the subprocess emits --output-cost and
track_cost writes rows during the uncapped window. A late
update_budget then starts the watcher against an already-
populated CSV. Regression test seeds a synthetic row during the
uncapped window, calls update_budget(budget_cap=5), and asserts
the watcher attaches and the snapshot reflects the new cap.
Finding 2 — two same-command jobs sharing a CSV count each other's spend
Watchers filter rows by command + timestamp only — there is no
job_id column. Two pdd-bug jobs that landed in the same CSV
(either via an explicit options.output_cost both clients passed
or via a process-wide PDD_OUTPUT_COST_PATH the parent process
set) would each see all rows and cancel early.
Fix: per-job CSV isolation in two places.
(a) _resolve_cost_csv_path always derives a per-job path under
project_root/.pdd/cost-<job_id>.csv when no explicit per-job
path is supplied on job.options, ignoring the process-wide
PDD_OUTPUT_COST_PATH (which was being treated as a default
that quietly forced all jobs onto one file).
(b) _run_click_command's subprocess env now sets
PDD_OUTPUT_COST_PATH to the per-job derived path, or
explicitly DELETES the inherited env var when no per-job
path was resolved. Both layers (the --output-cost arg AND
the env-var fallback in track_cost) now route to the
per-job file. Regression tests assert two concurrent jobs
get distinct derived paths and that a parent-process
PDD_OUTPUT_COST_PATH does not leak across jobs.
Finding 3 — track_cost skips the row when the wrapped command raises
Old code only wrote the CSV row when exception_raised was None.
A subprocess that called the LLM (spending money) and then
raised before returning recorded nothing — so the watcher's
running spend missed those dollars and the cap could be bypassed
by simply crashing after the LLM call.
Fix: drop the exception_raised gate around the row write. Always
attempt to emit a row in finally; source cost+model from the
return tuple when available, otherwise fall back to whatever
partial state llm_invoke pushed to
ctx.obj['partial_cost' | 'last_model'] before the exception
propagated. Documents the two new ctx.obj contract keys so
future llm_invoke / executor changes can populate them. A
failed command with no partial data still writes a row with
cost=0 so the watcher's command-filtered counts reflect that
the subprocess ran. Regression test invokes a track_cost-
wrapped click command that sets partial_cost=4.25 on ctx.obj
and then raises; asserts the CSV contains the 4.25 row.
…_invoke partial cost
Sixth review pass. Each finding paired with a regression test that
reproduces the broken behaviour on the prior code.
Finding 1 — relative explicit options.output_cost resolved against
server cwd (not project_root)
Server cwd may differ from project_root (the subprocess cwd). A
caller passing options.output_cost="custom/cost.csv" then had the
watcher reading server-cwd/custom/cost.csv while the subprocess
wrote project_root/custom/cost.csv. Spend stayed $0.
Fix: _resolve_cost_csv_path now absolutizes relative explicit paths
against project_root and writes the absolute form back to
job.options["output_cost"] so the --output-cost arg the subprocess
receives is the same absolute path the watcher reads.
Finding 2 — two same-command jobs sharing one CSV count each other's
spend
Watchers filtered rows by command + timestamp only. Two pdd-bug jobs
that both passed the same explicit options.output_cost (or any
shared CSV) each saw the other's rows and could cancel from foreign
spend.
Fix: add a per-job attribution column.
- track_cost reads PDD_JOB_ID from env and writes it as a new
column. Three header generations are now handled by
the writer: legacy (no attempted_models), mid (attempted_models
only), and new (attempted_models + job_id) — existing CSVs are
never rewritten, only appended to in their existing layout.
- cost_budget_watcher.watch() takes an optional job_id parameter.
When set, the watcher counts only rows whose job_id column
matches; legacy rows missing the column are skipped (the
conservative choice — never count rows we cannot attribute when
attribution is required).
- JobManager._run_click_command sets env['PDD_JOB_ID']=job.id; the
watcher receives job_id=job.id at start. Concurrent same-command
jobs sharing a CSV are now per-job-isolated.
Finding 3 — track_cost's exception-path fallback was a no-op in
production
My prior fix made track_cost write a row even when the wrapped
command raised, falling back to ctx.obj['partial_cost' | 'last_model']
for cost/model. But nothing in production populated those keys —
only was published — so failed commands still
recorded cost=0 and budget enforcement still missed the spend.
Fix: add the producer side. New module-level
_publish_call_outcome_to_ctx(cost, model) in llm_invoke; called at
each LLM-call success return (cloud path, responses path, chat path).
partial_cost accumulates across calls within one tracked command;
last_model overwrites with the most recent. Best-effort no-op when
no Click context is active. Regression test invokes the helper
twice and asserts the accumulated + overwrite semantics.
Test patterns updated to absorb the new trailing job_id column on
new-format CSVs while staying compatible with legacy/mid CSVs that do
not carry it.
…om executors, prompts in sync
Seventh review pass. Each runtime fix paired with a regression test;
prompts updated so the next regeneration does not erase the runtime
contracts.
Finding 1 — caller passes a legacy/mid-format CSV; job_id filter
freezes spend at zero
My prior fix made the watcher require row.get('job_id') == self._job_id
unconditionally. For CSVs whose header lacks the job_id column
entirely (legacy or mid-format files the caller explicitly passed via
options.output_cost), every row was dropped and the watcher never
saw spend.
Fix: gate the job_id check on the column actually being present in
the CSV header (self._fieldnames). Legacy/mid-format CSVs fall back
to the command + timestamp filter (the historical enforcement
behaviour), while new-format CSVs keep strict per-job isolation.
Concurrent same-command jobs sharing a NEW-format CSV remain
isolated; jobs sharing a LEGACY-format CSV are not (the caller
opted into the shared file). Regression test seeds a mid-format
CSV with one row and asserts watcher.spent() == 50.
Finding 2 — custom executors bypass _run_click_command, so subprocesses
they spawn never see PDD_JOB_ID
The public JobManager set env['PDD_JOB_ID'] inside _run_click_command
only. The private GitHub App's pdd-issue path uses a custom executor
(JobManager(executor=...)) and never reaches that code, so every
nested PDD subprocess wrote a row with empty job_id and the
watcher (now job_id-filtered for new-format CSVs) skipped them.
Fix: in _execute_job, set os.environ['PDD_JOB_ID']=job.id around the
custom-executor call (try/finally to restore the prior value). Any
subprocess the custom executor spawns inherits PDD_JOB_ID via
process env, so track_cost writes rows with the correct
attribution and the watcher counts them. Documents the
max_concurrent>1 caveat in the integration contract (the os.environ
mutation is best-effort under concurrency; the executor should
also pass PDD_JOB_ID explicitly via subprocess env). Regression
test asserts the env var is set during the custom executor's run
and restored after it returns.
Finding 3 — prompts diverged from runtime contracts (regeneration
would erase the runtime fixes)
Four prompt files updated so the next pdd sync regenerates code
consistent with current behaviour:
- cost_budget_watcher_python.prompt: watch() signature gains
job_id parameter; Inputs and Outputs documents naive-timestamp
.astimezone() interop and the header-gated job_id filter.
- track_cost_python.prompt: CSV Reader Contract documents three
header generations (legacy / mid / new), UTC-aware timestamp
writer, the new job_id column (env-driven from PDD_JOB_ID), and
the always-write-row-on-exception rule with partial_cost /
last_model ctx.obj fallback.
- server/jobs_python.prompt: submit resolves per-job CSV at
submit time regardless of cap; watcher wiring passes
job_id=Job.id; PDD_JOB_ID env propagation for both the default
subprocess path and the custom-executor path is documented.
- llm_invoke_python.prompt: new producer-side requirement to
define _publish_call_outcome_to_ctx(cost, model) and call it
at each successful LLM-call return (cloud / responses / chat),
accumulating partial_cost and overwriting last_model so
track_cost has real data to write on the exception path.
All 573 server + budget + track_cost tests pass locally.
…nc architecture
Eighth review pass. Each runtime fix paired with a regression test; the
spec surface (prompts + architecture.json) updated so regeneration
stays consistent.
Finding 1 — os.environ['PDD_JOB_ID'] around custom executor races
under max_concurrent>1
Prior code did try/finally save+restore of os.environ['PDD_JOB_ID']
around each custom-executor coroutine. With two concurrent jobs A
and B, A's await yielded while B overwrote PDD_JOB_ID; when A
resumed it read B's id, and B's finally restored A's leaked value.
Reproduced locally with the exact pattern the reviewer described.
Fix: remove the os.environ mutation entirely. Add
JobManager.subprocess_env(job, *, base_env=None) -> Dict[str, str]
which returns a per-job env dict carrying PDD_JOB_ID=job.id and the
resolved PDD_OUTPUT_COST_PATH (or explicitly removing any inherited
value). Custom executors MUST pass this dict as env= to
subprocess.Popen / asyncio.create_subprocess_*; the default
subprocess path uses the helper internally so both code paths
share one implementation. Process-global state is never mutated,
so any number of concurrent jobs are isolated by construction.
Regression tests:
- Two concurrent jobs get distinct PDD_JOB_ID values from
subprocess_env (no contamination).
- os.environ['PDD_JOB_ID'] stays None throughout a custom
executor run (the prior mutation is gone).
Finding 2 — shared mid-format CSVs still cannot isolate jobs
track_cost preserved the existing header to honour 'never break
existing files', but that meant rows written to a pre-existing
mid-format CSV (no job_id column) never carried job_id even when
PDD_JOB_ID was set. The watcher's header-gated job_id filter then
fell back to command+timestamp for that file, and two same-command
jobs sharing the file again counted each other's spend.
Fix: when track_cost would append to a mid-format CSV AND
PDD_JOB_ID is set, MIGRATE the file in place — rewrite header to
add the job_id column, backfill empty job_id on each existing row,
atomic os.replace of a temp file. The CSV reader contract permits
this 'legacy-header migration path' explicitly. After migration,
the watcher's strict job_id filter kicks in for that file and
per-job isolation actually works. When PDD_JOB_ID is empty (CLI
use outside the server), no migration runs and the legacy
behaviour is preserved.
Finding 3 — architecture.json had stale watcher signature and
Job metadata
Refresh JobManager interface block:
- watch signature gains job_id=None
- JobManager.update_budget signature includes node_count
- new entries: JobManager.update_node_count,
JobManager.subprocess_env
Re-validates as JSON. Future arch-driven conformance passes will
now agree with the runtime.
All 575 server + budget + track_cost tests pass locally.
…cy CSV migration, watcher inode detection Ninth review pass. Each runtime fix paired with a regression test. Finding 1 — legacy custom executors that do not call subprocess_env lose job_id attribution on spawned subprocesses Removing the os.environ mutation entirely (8th pass) made concurrency safe but broke any private executor that did not yet know to call JobManager.subprocess_env(job). Child PDD commands wrote empty job_id, the watcher skipped them, and a $10 spend under a $5 cap completed normally. Fix: re-introduce os.environ['PDD_JOB_ID']=job.id as a SAFETY NET but only under max_concurrent==1 (sequential execution → no race). Save the prior value and restore in finally so the env never leaks past this job's execution. Under max_concurrent>1 the manager still does not touch os.environ; the executor MUST call subprocess_env() and pass the result as env= to each spawned subprocess. Regression tests cover both paths: the safety net is present under max_concurrent=1 and absent under max_concurrent>1. Finding 2 — legacy-format CSV (no attempted_models, no job_id) is not migrated; jobs sharing it still cross-count The 8th-pass mid → new migration handled the mid-format header case but left the oldest layout untouched. Two same-command jobs explicitly sharing a legacy CSV produced rows without job_id, the watcher fell back to command+timestamp, and one counted the other's spend ($30 visible to job-a from $10 + $20). Fix: add _migrate_legacy_to_new_header that rewrites the file in place adding BOTH attempted_models and job_id columns (existing rows get empty values for both). Triggered when track_cost would append to a legacy CSV AND PDD_JOB_ID is set. After migration the watcher's strict job_id filter activates and shared legacy files gain real per-job isolation. When PDD_JOB_ID is empty (CLI use outside the server), no migration runs and the legacy warn-and- append path is preserved. Finding 3 — watcher does not detect track_cost's in-place migration (new inode at same path) track_cost's migration uses os.replace, which yields a new inode at the same path. The watcher's cached _byte_offset and _fieldnames remained based on the pre-migration file, so it never re-parsed the new header and the job_id filter never activated after migration — job-a still saw job-b's spend on the same CSV. Fix: track _known_inode in the watcher. On each poll, stat the file and compare st_ino against the cached value; on mismatch call _reset_tail_state() to discard the stale offset and header cache so the new header (which now carries job_id) is re-parsed on the next poll. Regression test seeds a mid-format CSV with a job-a row, lets the watcher see it under legacy fallback, then runs the migration helper and appends a job-b row; asserts the post-migration watcher's strict filter yields $0 (job-a's migrated row has empty job_id, job-b's row matches the wrong job_id). Test infra: added an autouse fixture in test_budget_control.py that saves/restores os.environ['PDD_JOB_ID'] around each test so the safety net's intentional persistence does not contaminate sibling test files that assume the env is clean. All 578 server + budget + track_cost tests pass locally.
…mpts current
Tenth review pass. Three runtime fixes paired with regression tests, plus
the three prompt files re-synced so the next regeneration does not
erase any of this pass's fixes.
Finding 1 — safety net set PDD_JOB_ID but not PDD_OUTPUT_COST_PATH
Legacy custom executors that do not call subprocess_env() relied on
the safety net for env propagation. The prior safety net set only
PDD_JOB_ID; the child subprocess had no cost-CSV path so track_cost
wrote nothing, the watcher saw $0, and a capped run completed
without enforcement.
Fix: when max_concurrent==1, set BOTH PDD_JOB_ID and
PDD_OUTPUT_COST_PATH (or explicitly remove the latter if no per-job
CSV was resolved). Save and restore both in finally so the env is
never leaked past this job. Regression test asserts the executor
sees BOTH vars and that both are restored after the run.
Finding 2 — concurrent migrations race on shared .migrate.tmp
Both migration helpers used a single hard-coded .migrate.tmp path
with no lock. Two processes simultaneously appending to the same
legacy/mid CSV could each read, each write their own tmp version,
and the second writer's os.replace would clobber a row the first
writer appended between the migration's read and replace.
Fix:
- per-writer unique tmp filename: <csv>.migrate.tmp.<pid>.<uuid>
(new helper _unique_tmp_path) so two concurrent migrations
cannot collide on the same tmp file.
- POSIX fcntl.flock on a sidecar <csv>.migrate.lock around the
whole migration (new _MigrationLock context manager). The
second concurrent attempt receives a non-blocking lock failure,
returns False, and the helper skips migration. The calling
track_cost re-reads the header AFTER the migration call to
detect the lock-contention skip and writes its row in whatever
format the file is currently in — never with the new
fieldnames against an unmigrated header.
- graceful no-op when fcntl is unavailable (non-POSIX hosts):
the lock returns False and we fall back to writing in the
existing format. No corruption, slightly degraded isolation.
Regression tests cover _unique_tmp_path uniqueness and
in-process flock serialization (one thread acquires, the other
is blocked).
Finding 3 — prompts diverged from runtime contracts again
After the prior pass added the safety net and migration paths,
the prompts still said the manager 'never mutates os.environ',
legacy/mid CSVs 'are never rewritten', and legacy shared CSVs
'are not isolated'. Regeneration would erase the runtime fixes.
Fix:
- server/jobs_python.prompt: documents the two-path env story
— primary subprocess_env() contract (concurrency-safe) and
the max_concurrent==1 os.environ safety net for legacy
executors (with try/finally restore semantics).
- track_cost_python.prompt: explains the legacy → new and
mid → new migration paths, the unique per-writer tmp
filename, the fcntl.flock on the sidecar .migrate.lock,
and the lock-contention skip-and-re-read pattern.
- cost_budget_watcher_python.prompt: explains the inode-
change detection required to pick up post-migration
headers and activate the strict job_id filter.
Performance note (reviewer): legacy/mid migration is O(file_size)
on first server-managed write to a pre-existing CSV. This is a
one-time cost — subsequent writes append in new format. Acceptable.
All 581 server + budget + track_cost tests pass locally.
…inked, prompt updated Eleventh review pass. Three runtime fixes + prompt re-sync to match. Finding 1 — concurrent append during migration loses rows The prior LOCK_NB skip-and-fall-back let one writer fall through to a legacy append while the other writer's migration os.replaced the file with a pre-append snapshot, silently deleting the contender's row. Reproduced in the wild: before_replace contained model_b, after_replace did not. Fix: replace _MigrationLock with _WriteLock that wraps the ENTIRE write block (read header, maybe migrate, append row) and uses a BLOCKING fcntl.flock(LOCK_EX). Contenders wait for the holder to release, then proceed with the post-migration header visible. The migration helpers themselves no longer take the lock — the caller (track_cost wrapper) holds it for the whole block. The helpers' docstrings call out the caller-must-hold invariant. Backwards-compat _MigrationLock alias retained. Finding 2 — unlinking the sidecar lock file allowed inode reuse The prior __exit__ unlinked <csv>.migrate.lock as cleanup. A later process opening the same path created a NEW inode, while the previous holder was still closing the OLD inode; two processes ended up with exclusive locks on different inodes for the 'same' lock path. Fix: do NOT unlink the lock file. The sidecar is tiny and persistent; the inode stays stable across all callers as long as nobody removes it. Class docstring documents the trade-off. Finding 3 — track_cost prompt contradicted itself on rewrites Row 5 said 'never truncated or rewritten in-place outside the legacy-header migration path' while row 8 described the migration as a rewrite, leaving the contract ambiguous to a regeneration pass. Fix: rewrite row 5 to say the file is APPEND-ONLY for ordinary writes and the ONLY rewrite path is the documented legacy-header migration; row 6 (concurrency) now explicitly describes the per-file flock that serialises the whole write block; row 8 describes the blocking lock semantics, the never-unlink rule, and the per-writer unique tmp filename. The three rows now agree on a single coherent contract. Test infra: relaxed track_cost mock-open assertions from assert_called_once_with to assert_any_call so the additional open() call on the sidecar lock file does not break unrelated tests. The _WriteLock catches all Exceptions on the open/fcntl path so mock-environment fileno() values don't crash production code (the lock degrades to 'unenforced' which test cases tolerate). All 582 server + budget + track_cost tests pass locally.
…t_budget, ms-precision started_at Twelfth review pass. Four runtime fixes (Findings 1+2+3 plus a precision bug surfaced while testing them) and three regression tests. Finding 1 — fast-exit jobs miss the final cost row A subprocess that writes its final cost row and exits in <2s never gets observed by the daemon thread's next poll (cleanup stops the watcher first), so the cap is silently not enforced. Reproduced with a $5 row under a $1 cap completing as COMPLETED. Fix: Watcher.flush() now returns bool (fired or not) so callers can await the status flip. JobManager._execute_job runs _final_watcher_flush BEFORE the result-handling block: flush synchronously consumes any new bytes, fires on_exceeded if the cap is crossed, and yields to the event loop up to 50 times waiting for _handle_budget_exceeded to set BUDGET_EXCEEDED. The terminal-status guard on the success branch then preserves it (the COMPLETED assignment is skipped). A secondary flush in the finally block remains for the exception path. Regression test seeds the CSV from a custom executor with current timestamp + correct job_id and asserts status reaches BUDGET_EXCEEDED. Finding 2 — /pdd settings reports $0 on uncapped runs Uncapped runs have no daemon watcher, so get_budget had no source of live spend during the run — job.cost is only set on subprocess exit, and the budget store snapshot is only created when a watcher starts. Fix: new module-level helper cost_budget_watcher.read_spent_now() does a one-shot synchronous CSV read using the same filtering logic as Watcher (commands, started_at, job_id). get_budget falls back to it when no watcher exists. /pdd settings on an active uncapped job now reports the actual accumulated spend. Finding 3 — late /pdd budget on fast-exiting uncapped job When update_budget started a watcher for the first time, the daemon thread might not poll before the job exited, so existing rows in the uncapped window were not enforced. Fix: update_budget calls watcher.flush() synchronously after starting or updating, so any pre-update rows trigger the cap immediately if crossed. Regression test pre-writes $5 of spend, applies a $1 cap via update_budget, and asserts BUDGET_EXCEEDED fires. Precision bug (surfaced by Finding 1's test) track_cost writes timestamps at millisecond precision (via isoformat(timespec='milliseconds')). JobManager records job.started_at with microsecond precision (datetime.now). A row written in the SAME millisecond as started_at has a parsed timestamp strictly less than started_at and was silently dropped by the watcher's ts < started_at filter. Fix: _normalize_started_at truncates to millisecond precision so the comparison aligns with the writer's resolution. Documented in the function docstring with the cross-module rationale. All 585 server + budget + track_cost tests pass locally.
…t_budget Thirteenth review pass. Three runtime correctness fixes around the watcher / read paths. Finding 1 — concurrent flush + daemon double-count the same bytes _consume_new_bytes was unlocked, so the daemon thread's poll and an inline flush() could each read from the same _byte_offset, parse the same rows, and each add cost to _spent. Reproduced as a single $5 row inflating to $10 of spend. Fix: lock changes to RLock (reentrant — flush already takes the lock after consume returns), and wrap the entire consume operation in self._lock via a new _consume_new_bytes_locked inner method. Two threads (or daemon+inline) now serialise; a row contributes to _spent exactly once. Regression test races two threads on watcher.flush() and asserts spent==5.0. Finding 2 — read_spent_now spawned a daemon thread per call The prior implementation constructed a real Watcher (which starts a background thread) and then called _consume_new_bytes inline; the two paths raced and double-counted, turning a $20k CSV into a reported $40k. Per-call thread spawn was also pure overhead. Fix: rewrite read_spent_now as a pure function — direct csv.DictReader + filter + sum, no Watcher, no thread, no shared state. Filter semantics (commands set, started_at, job_id when column present) match Watcher's _row_matches exactly so results agree. Two regression tests: (a) 20 sequential reads of a $10 CSV must each return $10 AND create no persistent threads; (b) 100×$5 rows must sum to $500 (a double-count bug yields $1000). Finding 3 — /pdd settings on capped run reports stale spend get_budget called watcher.spent() which returned the daemon's last cached value — up to poll_interval (2s) stale. A row written immediately before /pdd settings would be invisible until the next poll. Fix: get_budget now ALWAYS calls read_spent_now() for the freshest spend, regardless of whether a watcher is running. The watcher's cache is no longer consulted (it cannot be fresher than a direct CSV read). Capped runs and uncapped runs share the same code path. Regression test seeds a $3 row immediately after the watcher starts and asserts /pdd settings reports $3 without waiting for the next poll. All 589 server + budget + track_cost tests pass locally.
…e, job.cost reconciled, perf
Fourteenth review pass. Three runtime correctness fixes + a perf
improvement (the reviewer's full-CSV-scan concern), each paired with a
regression test.
Finding 1 — update_budget can return before BUDGET_EXCEEDED is applied
update_budget called watcher.flush() which schedules
_handle_budget_exceeded via run_coroutine_threadsafe. The caller
returned immediately; the executor's exit path then set COMPLETED
before the handler ran, and the handler's status-active gate
short-circuited the cancel — final status was COMPLETED, no event.
Reproduced as: spent $5, cap $1, status COMPLETED.
Fix: after flush(), if cumulative spend already crosses the new
cap, await up to 50 × 10ms for job.status to enter
_TERMINAL_STATUSES. The wait covers BOTH paths — the inline
flush fired, AND the watcher daemon thread fired between
_start_watcher_for and our flush (in which case flush returns
False because _state.fired is already True, but the scheduled
handler is still queued). Regression test seeds a $5 row,
applies a $1 cap, and asserts status==BUDGET_EXCEEDED
immediately after update_budget returns — no extra sleep.
Finding 2 — queued job's watcher had no time filter
job.started_at is None until the executor runs. _start_watcher_for
passed started_at=job.started_at, so the watcher had no lower
time bound and counted historical rows in a shared legacy CSV.
A day-old $99 row could cancel a queued job under /pdd budget 50
before the job even started.
Fix: _start_watcher_for uses
job.started_at or datetime.now(timezone.utc) as the baseline. A
queued job's watcher starts counting only from the moment the
cap was set; once the job actually runs, started_at >= the
baseline so no rows are missed. Regression test creates a
queued job behind a long-running hold job, points it at a
shared legacy CSV with a day-old $99 row under /pdd budget 50,
and asserts the queued job is not cancelled by the historical
row.
Finding 3 — /jobs/{job_id}.cost did not match /pdd settings spend
job.cost was only set from the executor's returned dict — often
0 for custom executors that spawn subprocesses they don't track
themselves — even though track_cost rows recorded real spend.
/jobs/{job_id} reported cost=0; /pdd settings reported real
spend; the two endpoints disagreed.
Fix: in _execute_job's finally block, sync job.cost to
max(job.cost, _compute_csv_spend(job)) BEFORE emit_complete and
watcher cleanup. _compute_csv_spend is a new helper that uses
the watcher's flush+spent (incremental — cheap) when a watcher
exists, falling back to read_spent_now for uncapped runs.
Same helper is used by get_budget so the two endpoints can
never diverge. Regression test asserts job.cost==5.0 after a
custom executor returns {'cost': 0.0} but writes a $5 CSV row.
Perf — get_budget no longer always full-scans the CSV
Reviewer measured ~0.26s on a 10MB / 200k-row shared CSV.
get_budget now uses watcher.flush() + watcher.spent() when a
watcher exists (incremental tail — only NEW bytes are parsed,
not the whole file). Uncapped runs still use read_spent_now,
but their per-job CSVs are bounded by one job's row count. The
expensive full scan only happens for explicit shared-CSV
uncapped runs.
All 592 server + budget + track_cost tests pass locally.
Five findings raised in the post-implementation code review of the
GitHub App budget surface. Each is paired with a regression test.
Finding 2 — cap-clear via REST not possible
`POST /commands/jobs/{job_id}/budget` collapsed "field omitted"
and "field explicitly None" onto the same "leave unchanged"
behaviour, so a caller (slash-command webhook or programmatic)
could never drop a previously-set budget_cap back to "no cap".
The prompt's `BudgetStore.update` contract documents this
distinction explicitly. Switch the route to forward only fields
present in `request.model_fields_set`, and rewrite the
`BudgetUpdateRequest` model_validator to enforce
"at-least-one-field-set" via `model_fields_set` rather than
value-is-None — so an explicit `{"budget_cap": null}` now
passes validation and reaches `update_budget` as None (clear),
while `{}` still 422s.
Finding 3 — parser prompt R6 silent on non-issue rejection
The generated parser rejects `/pdd budget node N` and
`/pdd budget max N` on non-issue active commands with a
message redirecting to `/pdd budget N`, but the prompt's R6
did not document this stricter behaviour. The next `pdd sync`
could regenerate a permissive parser and pass review. Update
R6 to spell out the rejection plus the rationale
(`effective_cap()` ignores node_budget/max_total_cap for
non-issue commands; accepting silently would set fields the
cap math ignores).
Finding 4 — amount validation duplicated three places
`validate_amount` (canonical, in `budget_settings.py`) was
mirrored line-for-line in two pydantic field validators on
`CommandRequest` and `BudgetUpdateRequest`, so changing the
ceiling required touching three places. Extract a single
module-level `_coerce_budget_amount_value` helper in
`models.py` that both pydantic validators delegate to. The
helper is kept in `models.py` (not imported from
`budget_settings`) to avoid an import cycle — documented in
the helper's docstring.
Finding 5 — `cancel_job` 200s on a BUDGET_EXCEEDED job
The early-409 guard at the cancel route enumerated only
`COMPLETED`/`FAILED`/`CANCELLED`, so a cancel posted after
the watcher already terminated the job returned 200 with
`cancelled=True` — misleading. Add `BUDGET_EXCEEDED` to the
guard and document at `JobManager.cancel` that the manager
still needs to handle `BUDGET_EXCEEDED` (the budget-handler
path calls `cancel()` to actually terminate the subprocess
AFTER flipping the status — so the manager's early-exit set
intentionally excludes `BUDGET_EXCEEDED`).
All 145 budget-control tests, 455 server + track_cost tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three correctness/contract gaps surfaced by a stronger reviewer.
Each is paired with a regression test.
Finding A — pdd-issue budget_cap was a silent no-op
`effective_cap("issue", ...)` IGNORES `budget_cap`; only
`node_budget` and `max_total_cap` participate in the issue
cap formula. A webhook handler literally forwarding
`/pdd budget N` on a pdd-issue job as `{"budget_cap": N}`
would therefore set a field the cap math never reads and the
watcher would never enforce the requested limit. Reproduced
live: `effective_cap("issue", budget_cap=30, node_budget=80,
max_total_cap=400, node_count=1)` returns 80.0, not 30.
Both routes (`POST /commands/execute` and
`POST /commands/jobs/{job_id}/budget`) now re-alias a bare
`budget_cap` to `max_total_cap` when the job's command is
`issue`. The alias yields to an explicit `max_total_cap` so
callers that send both keep the more-specific value. Prompt
R3 / R6 references updated in commands_python.prompt; the
parser already produces `budget_max_set` for this case, but
the routes must finish the alias for callers that bypass the
parser.
Finding B — daemon-fired race could end the job COMPLETED
When the watcher's daemon poll observed the cap crossing but
the scheduled `_handle_budget_exceeded` coroutine had not run
before the executor exited, `_final_watcher_flush.flush()`
returned False (the fire-once guard short-circuits when
`_state.fired` is already True). The final-flush helper
treated False as "nothing to wait for" and let the executor
set `COMPLETED`; the handler then bailed because the job
was no longer in QUEUED/RUNNING — final status was
COMPLETED, no `budget_exceeded` event emitted.
Expose a `Watcher.fired()` signal (watcher-wide "has the cap
been crossed?" rather than flush()'s "did THIS call fire?")
and use it as a second wait condition in `_final_watcher_flush`.
Now the flush either fires inline (returns True) OR sees the
daemon already fired (`fired()` True), and in both cases the
helper waits up to 50×10ms for terminal status before
control returns to the executor.
Finding C — runtime APIs not declared in prompt interfaces
`Watcher.flush`, `Watcher.fired`, `read_spent_now`,
`JobManager.subprocess_env`, `JobManager.update_node_count`,
the `node_count` field on `BudgetUpdateRequest`, and the
`node_count` kwarg on `update_budget` were all used at
runtime but not declared in the prompts'
`<pdd-interface>` blocks. A future `pdd sync` /
`pdd generate` could legitimately drop any of them — the
re-broken surface would re-introduce the race in Finding B
AND silently regress the fresh-spend path on `/pdd settings`.
Declare each in the matching prompt interface and in
architecture.json so the conformance check enforces them
on regeneration. Also tighten the R6 contract docs in
`commands_python.prompt` for the alias + null-clear
semantics (so a regenerated route preserves them).
All 152 budget-control + 455 server + 29 track_cost tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ands A failed second tracked command sharing one `ctx.obj` with a prior successful command used to read the prior command's `partial_cost` and `last_model` and write them into its own CSV row. Concretely: command A invokes the LLM, `llm_invoke._publish_call_outcome_to_ctx` populates `ctx.obj["partial_cost"] = 7.0` and `ctx.obj["last_model"] = "model-a"`. Command B starts on the same `ctx.obj`, raises before any LLM call, and `track_cost`'s exception-path fallback reads `partial_cost` / `last_model` from `ctx.obj` — which still hold A's values. B's row therefore claims B spent $7.0 on `model-a`, inflating cumulative spend and potentially tripping a budget cap on a command that itself spent nothing. `attempted_models` was already snapshotted/cleared/restored for exactly this reason; extend the same pattern to `partial_cost` and `last_model`. Pop+restore (not just pop) so a nested track_cost invocation inside a parent tracked command preserves the parent's accumulated state on the way back up. Also rewrites the daemon-fired race regression test to use a custom executor that writes its cost row mid-flight (so it passes the watcher's `started_at` filter) and sleeps long enough for the daemon to poll and queue the handler before the executor returns — the prior version of this test pre-seeded the CSV before submit, which `started_at` correctly filtered out so the race window was never exercised. All 609 budget-control + server + track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three more contract/correctness gaps surfaced by a stronger reviewer
on the updated PR head. Each is paired with a regression test.
Finding A — non-scalar budget JSON returned HTTP 500
`{"budget_cap": []}` reached `float([])`, which raises
`TypeError`, which FastAPI converts to HTTP 500 — a malformed
client request looked like a server error. The shared coercion
helper now adds an explicit `isinstance(value, (int, float))`
branch; non-numeric, non-string types fall to an explicit
`ValueError`, which pydantic surfaces as HTTP 422 (the right
outcome for invalid request JSON).
Finding B — null `budget_cap` on pdd-issue was a silent no-op
The route's pdd-issue alias rule only fired when
`budget_cap` was non-None, so a caller posting
`{"budget_cap": null}` on an issue job intending "clear the
alias" left the old `max_total_cap` active and the visible
clear was a silent no-op. Aliasing an explicit None too is
what makes the clear actually clear: the route now pops
`budget_cap` (numeric OR explicit None) into the `max_total_cap`
slot whenever the caller did not explicitly set the latter.
Finding C — no typed `budget_exceeded` event on the WebSocket
`JobManager.callbacks.emit_budget_exceeded` fires when the
watcher trips, and `BudgetExceededMessage` exists in the
models module, but `create_websocket_routes` only registered
`on_output` and `on_complete`. Subscribed clients never
received the typed payload — only the subsequent `complete`
message, with no way to distinguish budget abort from clean
completion. Add `emit_job_budget_exceeded(job, spent,
effective_cap)` and register an `on_budget_exceeded` callback
that assembles the typed `BudgetExceededMessage` (carrying
pdd-issue specifics: node_budget, max_total_cap, node_count)
and broadcasts via `manager.broadcast_job_message`. The
websocket prompt's `<pdd-interface>` requirements list is
extended to lock the new helper + registration in for future
regeneration. The route-test fixture's mock models module
is updated to expose `BudgetExceededMessage` so the
collection-time import in websocket.py resolves.
All 620 budget-control + server + track_cost tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three runtime-correctness gaps surfaced by another deeper review pass. Each is paired with a regression test that reproduces the broken behaviour on the prior code. Finding A — orphan watcher when /pdd budget is set on a queued job `update_budget` on a still-queued job with no active watcher (cap was None at submit) calls `_start_watcher_for` to spin one up. When `_execute_job` later runs, it calls `_start_watcher_for` AGAIN. The dict assignment `self._watchers[job.id] = new_watcher` overwrote the prior Watcher without calling its `.stop()` — the previous daemon thread kept polling the CSV, double-counting spend against the new sibling watcher, and could fire `on_exceeded` long after the job had moved past QUEUED/RUNNING (where the handler then silently no-ops). `_start_watcher_for` now calls `_stop_watcher_for(job.id)` at the top so the second call replaces the previous Watcher cleanly. The stop happens BEFORE the cap re-compute so a code path that ends up returning None (cap is now None) still flushes the previous watcher rather than leaking it. Finding B — WebSocket reconnect hangs on BUDGET_EXCEEDED jobs `websocket_job_stream`'s "job already done — send result + close" branch enumerated `[COMPLETED, FAILED, CANCELLED]`. A client reconnecting after the watcher tripped the cap fell through to the input-loop and waited forever on `websocket.receive_text()`. Add `BUDGET_EXCEEDED` to the terminal-state set; the reconnect path now sends the typed `BudgetExceededMessage` FIRST (so the client renders the cap-trip UI without an extra REST round-trip), then the standard `complete` summary, then closes. Finding C — `complete` could be emitted before `budget_exceeded` `_handle_budget_exceeded` awaited `cancel()` BEFORE `emit_budget_exceeded`. cancel() injects CancelledError into the executor task; the resulting finally block emits the `complete` callback. The event loop could schedule the finally-block emit ahead of the still-awaiting `emit_budget_exceeded`, so subscribers that close on `complete` (the common client pattern) missed the typed payload entirely. Swap the order: emit the typed callback first, then cancel. The typed event is now the LAST event observed before `complete`, which is the contract the WebSocket consumer needs. All 623 budget-control + server + track_cost tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three runtime lifecycle gaps from another deep-review pass. Each is paired with a regression test that fails on the prior code. Finding A — watcher orphaned when /pdd budget is set on a queued job and /pdd stop cancels before _execute_job runs `update_budget` starts a Watcher on a still-queued job via the "no watcher running yet" branch. If `/pdd stop` cancels the task before `_execute_job` ever runs, the cleanup site — `_execute_job`'s finally block — never executes and the watcher's daemon thread keeps polling forever. Move watcher cleanup into the `_on_task_done` callback registered by `submit`, which fires unconditionally on task completion (cancelled-while-queued, cancelled-while-running, or normal exit). `_stop_watcher_for` is already idempotent (pop()-based) so the original finally-block call now becomes a no-op when task-done already cleaned up. Finding B — slow subscriber callback delayed subprocess termination `_handle_budget_exceeded` `await`ed `emit_budget_exceeded` BEFORE `cancel()`. A slow `on_budget_exceeded` subscriber (e.g. a stalled WebSocket consumer) blocked the await chain for seconds, during which the subprocess kept running AND spending. Add a synchronous `_signal_cancel(job_id)` helper that sets the cancel event and SIGTERMs the subprocess WITHOUT awaiting anything. The handler now: (1) sets status + updates store; (2) signal-cancels synchronously (subprocess immediately starts dying); (3) awaits emit_budget_exceeded (slow client is fine — the process is already gone); (4) awaits full cancel() for task teardown + SIGKILL escalation. The typed event still beats `complete` to the wire because step 3 runs before step 4 emits `complete` via the executor's finally block. Finding C — reconnect payload aliased effective_cap to job.cost The WebSocket reconnect branch for a BUDGET_EXCEEDED job set `effective_cap=float(job.cost)`, so cap $400 with spend $401.23 reported back `effective_cap=$401.23` — collapsing the two values onto one number. The reconnect client had no way to know what the active cap was at the moment of crossing. Recompute the cap via `budget_settings.effective_cap(job.command, ...)` on the reconnect path; fall back to `job.cost` only when the budget_settings module is unavailable. Also updates `jobs_python.prompt` and `websocket_python.prompt`'s requirements lists to document the three-step termination sequence, the idempotent `_start_watcher_for`/task-done cleanup, and the recompute-effective-cap-on-reconnect contract — so a future `pdd sync` regeneration cannot silently regress any of them. All 627 budget-control + server + track_cost tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three runtime-ordering gaps from another deep review pass. Each is paired with a regression test that fails on the prior code. Finding A — complete could still beat budget_exceeded to the wire Even after the prior pass added `_signal_cancel`, a slow `on_budget_exceeded` subscriber could let `_execute_job.finally` emit `complete` BEFORE `emit_budget_exceeded` finished. The contract that subscribers closing on `complete` still see the typed event was racey. Add a per-job `_budget_broadcast_done: Dict[str, asyncio.Event]` that `_handle_budget_exceeded` exposes BEFORE signalling cancel and sets AFTER awaiting `emit_budget_exceeded`. `_execute_job`'s finally now waits on this event (bounded by a timeout) before calling `emit_complete`, so the typed event always lands first. Critical follow-on: drop the prior `await self.cancel(job_id)` at the end of `_handle_budget_exceeded`. `cancel()` calls `task.cancel()`, which injects `CancelledError` into the executor task while its finally is awaiting the broadcast-done gate — the wait re-raises `CancelledError` and SKIPS `emit_complete`, breaking the new ordering. Subprocess termination is already handled by `_signal_cancel` (SIGTERM) plus `_escalate_kill` (SIGKILL escalation, finding B), so letting the executor exit naturally is both sufficient and necessary. Finding B — SIGKILL escalation delayed by slow callback chain `_signal_cancel` only sent SIGTERM. A subprocess that ignores SIGTERM was only force-killed inside `cancel()`, which used to be awaited AFTER `emit_budget_exceeded`. A slow callback could leave the stubborn child running for the full callback duration. Add a synchronous `_signal_cancel` companion: `_escalate_kill(job_id)`, spawned via `loop.create_task` when a real subprocess is actually running. It polls the process for ~2s and SIGKILLs if still alive. The kill is now independent of any await chain, so spend stops accumulating immediately. Finding C — live WebSocket stream never closes on terminal job `websocket_job_stream`'s input loop only `await`ed `receive_text()`. When the job reached a terminal state the route had no signal to break the receive_text wait — the connection stayed open indefinitely after `complete`. The loop now races `receive_text` against a 250ms job-status poll via `asyncio.wait(..., FIRST_COMPLETED)` and closes the WebSocket when the job becomes terminal. Subscribers no longer leak. The `jobs_python.prompt` requirements list is also updated to document the new termination sequence (5-step) and the explicit "do not call self.cancel() from the budget handler" invariant so the next `pdd sync` cannot regenerate the prior behaviour. All 175 budget-control tests + 91 route tests pass in their canonical subsets. (One pre-existing test-isolation flake between `tests/test_budget_control.py::TestMidFormatCsvMigration` and `tests/server/routes/` collection — documented as a known issue in `tests/server/conftest.py` — is unrelated to this change and reproduces on prior commits when the same combo is run.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds the full GitHub App control-comment surface for budget management: startup settings comments,
/pdd budget,/pdd budget node,/pdd budget max,/pdd settings,/pdd stop, with reactive active-run updates and budget enforcement for both normal commands andpdd-issuedefaults. Per @Serhan-Asad's clarification, this is intentionally broader than a generic cost watcher — it covers the public prompts/specs/server helpers/interfaces in this repo that drive App/executor behavior.Closes #1128
Changes Made
Prompts Modified
pdd/prompts/server/jobs_python.prompt— threadbudget_cap/node_budget/max_total_cap/node_countthroughJobandJobManager.submit; addupdate_budget; integratecost_budget_watcher; introduceBUDGET_EXCEEDEDstatus; trigger existing cancel path on cap hit.pdd/prompts/server/models_python.prompt— addBudgetSettings,BudgetUpdateRequest,BudgetExceededMessage,SlashCommandResultmodels and a new job-status enum value.pdd/prompts/server/routes/commands_python.prompt— newGET/POST /commands/jobs/{job_id}/budgetendpoints; defaults wiring forpdd-issue($80per node,$400max).pdd/prompts/track_cost_python.prompt— formalize the cost-CSV read contract so external watchers can rely on the schema.Prompts Created
pdd/prompts/cost_budget_watcher_python.prompt— daemon-thread CSV poller withupdate_cap, idempotent.stop(), no-op when cap isNone.pdd/prompts/server/budget_settings_python.prompt— per-job settings store, effective-cap formula (min(node_budget * node_count, max_total_cap)),pdd-issuedefaults, validators.pdd/prompts/server/slash_command_parser_python.prompt— pure parser for/pdd ...commands with fenced-block / bot-comment / dedupe handling and authorization helpers.pdd/prompts/server/budget_comments_python.prompt— pure Markdown renderers matching the issue's literal startup/settings/stop strings.Documentation Updated
README.md— new "GitHub App control comments" subsection documenting startup comments and/pddcommands.CHANGELOG.md— Unreleased entry referencing Add /pdd budget control comments for GitHub App runs (clean Opus rerun) #1128.architecture.json— updated module graph for new modules.Review Checklist
Next Steps After Merge
pytest -vv tests/to verify functionality.Created by pdd change workflow