Skip to content

fix(test-runner): cap bun heap with --smol, env-gate scope:all, process-group kill on timeout#962

Merged
zaxbysauce merged 3 commits into
mainfrom
claude/trusting-goldberg-a7543f
May 22, 2026
Merged

fix(test-runner): cap bun heap with --smol, env-gate scope:all, process-group kill on timeout#962
zaxbysauce merged 3 commits into
mainfrom
claude/trusting-goldberg-a7543f

Conversation

@zaxbysauce
Copy link
Copy Markdown
Owner

Summary

  • �un --smol on every test run: both the dispatch path (src/lang/default-backend.ts) and legacy path (src/tools/test-runner.ts) now emit ['bun', '--smol', 'test', ...]. This caps bun's heap growth and eliminates OOM-driven OpenCode session crashes during broad test runs, matching the per-file CI pattern already used throughout the repo.

  • scope: 'all' is now environment-gated: the agent-settable �llow_full_suite argument has been removed from TestRunnerArgs and the Zod schema. The only unlock is SWARM_ALLOW_FULL_SUITE=1 in the environment (CI/maintainer sessions only — not accessible via LLM tool call). The blocked-response message deliberately omits the env var name to prevent prompt-engineering.

  • Process-group kill on timeout: �unSpawn gains an opt-in killProcessTree: true flag. When set (only by the test-runner), the Node spawn path uses detached: true and kills via process.kill(-pid) on POSIX or askkill /PID /T /F on Windows, reaping jest/vitest worker-pool descendants. The ~30 other �unSpawn callers are unaffected (opt-in only).

  • Prose docs aligned: AGENTS.md §6, docs/engineering-invariants.md, and all agent-facing skill files updated to reflect the new gate. Agent-facing docs do not reveal the env var name.

Root cause

The tool crashed OpenCode sessions via OOM because (a) bun ran without --smol so heap grew unbounded across a 50-file batch, (b) �llow_full_suite: true was documented directly in the schema, teaching LLMs how to bypass the scope block, and (c) the schema alone enforced nothing — prose warnings in AGENTS.md failed repeatedly because nothing in the tool itself enforced safety.

Invariant audit

  • 1 (plugin init): not touched — no changes to init path, startup subprocess, or withTimeout wrapping
  • 2 (runtime portability): not touched — no new �un: imports outside bun-compat.ts; dist/index.js passes
    ode --input-type=module import OK; �undle-portability.test.ts / �undle-plugin-shape.test.ts unaffected
  • 3 (subprocesses): touched — src/utils/bun-compat.ts: added killProcessTree opt-in; Node path uses detached: true + process.kill(-pid) (POSIX) / askkill /T (Windows); array-form spawn only; explicit cwd; stdin: 'ignore' inherited; timeout unchanged; all other ~30 callers unaffected (opt-in).
    ode scripts/repro-704.mjs passed all 3 timing assertions. dist import OK verified.
  • 4 (.swarm containment): not touched — no new process.cwd() callers; no .swarm/ write path changes
  • 5 (plan durability): not touched — no ledger, projection, or schema changes
  • 6 (test_runner safety): touched — �llow_full_suite arg removed; SWARM_ALLOW_FULL_SUITE env gate added; --smol added to both bun command builders; killProcessTree: true added to test-runner spawn. Tests: est-runner.test.ts 115/0, est-runner-scope-cap.test.ts 14/2 (2 pre-existing), est-runner-dispatch-parity.test.ts 19/0, est-runner-history.test.ts 20/0, �un-compat.test.ts 8/0.
  • 7 (test writing): touched — all modified test files use �un:test; no mock.module() calls in changed tests; os.tmpdir() + path.join() used throughout; env save/restore via try/finally in every new env-mutation test
  • 8 (session state): not touched — no module-level globals changed
  • 9 (guardrails/retry): not touched — no transient-error or circuit-breaker changes
  • 10 (chat/system msg): not touched — no message hook changes
  • 11 (tool registration): not touched — no new tools; est_runner schema narrowed (arg removed, not added)
  • 12 (release/cache): not touched — version files untouched; release fragment at docs/releases/pending/test-runner-smol-env-gate-process-kill.md

Test plan

  • sc --noEmit — exit 0, no type errors
  • �iome ci . — 0 errors (17 pre-existing warnings in src/hooks/shell-write-detect.ts, verified on main)
  • �un run build — exit 0; dist import OK (node ESM import)
  • [x]
    ode scripts/repro-704.mjs — T1/T2/T3 all OK
  • est-runner.test.ts — 115 pass / 4 skip / 0 fail
  • est-runner-scope-cap.test.ts — 14 pass / 2 fail (pre-existing; baseline has 3 fail)
  • est-runner-dispatch-parity.test.ts — 19 pass / 0 fail
  • est-runner-history.test.ts — 20 pass / 0 fail
  • �un-compat.test.ts — 8 pass / 0 fail
  • ests/unit/services/** — 0 fail across all 34 service test files
  • ests/unit/agents/** + ests/unit/hooks/** — 0 fail
  • ests/unit/config — 1244/5 (5 pre-existing, identical count on origin/main)
  • ests/unit/cli + ests/unit/commands — 1040/471 fail (identical on origin/main; pre-existing)

Pre-existing failures (not regressions)

All confirmed on origin/main baseline:

  • est-runner-scope-cap.test.ts tests 9 and 11: expect 'scope_exceeded' but get 'error'; predate MAX_SAFE_SOURCE_FILES=1. My branch reduced pre-existing failures from 3 → 2.
  • ests/unit/cli / ests/unit/commands: 471 failures matching origin/main exactly.
  • ests/unit/config: 5 failures matching origin/main exactly.
  • diagnostic-gating, doc-scan, phase-complete-lean-turbo-*, sast-and-cochanger-tools, ool-registration-conformance, update-task-status.gate-fix, pre-check-batch,
    egistration-smoke,
    epo-graph-codex-adversarial,
    epo-graph.adversarial: all in untouched files, all matching baseline.

codex and others added 2 commits May 22, 2026 14:02
…ss-group kill on timeout

- Add `--smol` to both bun command builders (dispatch default path in
  `src/lang/default-backend.ts` and legacy path in `src/tools/test-runner.ts`)
  to cap heap growth and prevent OOM-driven session crashes.

- Remove agent-settable `allow_full_suite` arg from `TestRunnerArgs` and
  Zod schema; gate `scope:'all'` behind `SWARM_ALLOW_FULL_SUITE` env var
  (CI/maintainer only, not accessible via LLM tool call). Error message
  deliberately omits the env var name.

- Add opt-in `killProcessTree: true` to `BunCompatSpawnOptions`; test-runner
  spawn sets this flag. Node path spawns `detached: true` and kills via
  `process.kill(-pid)` (POSIX) or `taskkill /T` (Windows). Default behaviour
  of all other ~30 `bunSpawn` callers is unchanged.

- Update `AGENTS.md` §6, `docs/engineering-invariants.md`, and all
  agent-facing skill files to reflect the new gate. Agent-facing docs do not
  reveal the env var name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…leanup, windowsHide)

- Add stdin: 'ignore' to bunSpawn call in runTests() to prevent Windows
  stdin-pipe block on child exit (AGENTS.md invariant 3, v7.3.3 regression)

- Capture and clear setTimeout handle in timeoutPromise so the timer is
  cancelled when the process exits normally, matching the bun-compat.ts
  pattern (lines 546-548)

- Add windowsHide: true to bunSpawn Node spawn path for consistency with
  bunSpawnSync and to prevent console flash on Windows with detached: true

- Add negative assertion: blocked-scope error must not contain
  'SWARM_ALLOW_FULL_SUITE' env var name, hardening the security guard
  against future accidental leakage

- Fix dispatch-parity test expectation for vitest command (rebase artifact:
  main added --reporter=json --outputFile args; test now matches actual code)

- Fix parsed.error type safety in MAX_SAFE_SOURCE_FILES test (use ?? '')
@zaxbysauce zaxbysauce force-pushed the claude/trusting-goldberg-a7543f branch from 7a255f7 to 40f1b72 Compare May 22, 2026 21:15
zaxbysauce added a commit that referenced this pull request May 22, 2026
…te refs, add subprocess patterns

- swarm-pr-review: add STALE BRANCH CHECK block before explorer dispatch.
  A branch behind main makes origin/main..HEAD show all of main's changes,
  inflating apparent scope (22 files appeared as 152 in PR #962). Documents
  git cat-file parent-SHA comparison and the correct git diff <parent>..<head>
  command for true PR scope.

- running-tests: fix duplicate 'all' scope row; update gate from removed
  allow_full_suite: true to SWARM_ALLOW_FULL_SUITE=1 env var (PR #962)

- engineering-conventions: update scope:'all' gate description from
  allow_full_suite arg to SWARM_ALLOW_FULL_SUITE env var; add windowsHide
  and clearTimeout patterns to the subprocess invariant description
@zaxbysauce
Copy link
Copy Markdown
Owner Author

@copilot resolve the merge conflicts in this pull request

Co-authored-by: zaxbysauce <10211642+zaxbysauce@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented May 22, 2026

@copilot resolve the merge conflicts in this pull request

Resolved and pushed in commit bf1e76f: merged origin/main, fixed the two skill-doc conflicts, and completed the merge commit so the PR is no longer in a conflicted state.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • https://api.github.com/graphql
    • Triggering command: /usr/bin/gh gh pr view 962 --json number,headRefName,baseRefName,mergeStateStatus,isDraft,headRefOid,url (http block)

If you need me to access, download, or install something from one of these locations, you can either:

@zaxbysauce zaxbysauce merged commit 2ea310a into main May 22, 2026
12 checks passed
@zaxbysauce zaxbysauce deleted the claude/trusting-goldberg-a7543f branch May 22, 2026 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants