Skip to content

Reliability improvements: leak fixes, two-tier cleanup, error diagnostics#12

Open
vsumner wants to merge 8 commits into
tintinweb:masterfrom
vsumner:feat/reliability-improvements
Open

Reliability improvements: leak fixes, two-tier cleanup, error diagnostics#12
vsumner wants to merge 8 commits into
tintinweb:masterfrom
vsumner:feat/reliability-improvements

Conversation

@vsumner

@vsumner vsumner commented Mar 18, 2026

Copy link
Copy Markdown

Summary

Hardens the subagent lifecycle with memory leak fixes, bounded resource cleanup, and improved error reporting.

  • Fix 11 memory leaks: session disposal, output file subscriptions, abort controllers, and event listener cleanup across
    agent-manager, agent-runner, and worktree modules
  • Two-tier cleanup: Tier 1 (1x retention) disposes sessions; Tier 2 (3x retention) fully removes stale records from the Map
    to prevent unbounded growth in long sessions
  • Async/sync flush race guard: asyncFlush now checks if syncFlush advanced writtenCount during the await, preventing
    duplicate JSONL entries in output files
  • Bounded output file reads: parseOutputFileResult reads only the last 256KB instead of the entire file
  • Smarter error tips: "Check provider status" tip only shown for likely API/model errors, not user-initiated stops or
    timeouts
  • Error message cleanup: Model info removed from record.error string (already carried in record.modelName)
  • Type safety: Removed any types from session_switch handler
  • Debug logging: Added process.env.DEBUG logging for captureStats failures instead of silent swallow
  • 35 new tests covering cleanup tiers, flush races, output file edge cases, group-join, and agent-runner lifecycle

Depends on

Merge #8 first, then rebase this PR onto master before merging.

Test plan

  • npx vitest run — 242 tests passing (4 pre-existing custom-agents.test.ts failures unrelated to this branch)
  • Tier 1 cleanup: session disposed, record retained in Map
  • Tier 2 cleanup: record fully removed after 3x retention
  • Flush race guard: no duplicate JSONL entries when syncFlush runs during async await
  • Error tip suppressed for abort/timeout/stopped errors
  • record.error no longer includes [model: ...] suffix

vsumner added 4 commits March 18, 2026 15:01
Resource leaks fixed:
- Session shutdown now disposes groupJoin, widget, agentActivity, batch timer
- outputCleanup called in dispose() and removeRecord() before session disposal
- AbortController nulled after completion/error to prevent retention
- DefaultResourceLoader disposed in agent-runner finally block
- execFileSync maxBuffer (10MB) added to all git calls in worktree.ts
- finishedTurnAge entries pruned when exceeding ERROR_LINGER_TURNS
- Stale typeListText cache replaced with inline buildTypeListText()
- session_switch now updates currentCtx when provided
- waitForAll gets timeout parameter (default 5min) with Promise.race
- output-file rewritten to use async writes with batching
- resetDefaults() added to agent-runner for global state cleanup

Test coverage added:
- agent-runner: resetDefaults, get/set for maxTurns and graceTurns
- group-join: registration, completion flow, timeout, stragglers, dispose
- output-file: JSONL format, async/sync flush, no-duplication, toolResult
- agent-manager: outputCleanup, abortController lifecycle, waitForAll
P1 fixes:
- Extract releaseRecordResources() helper to deduplicate 4x cleanup pattern
- Replace dynamic require("node:fs") with static appendFileSync import
- Clear timeout timer in waitForAll when Promise.allSettled wins the race

P2 fixes:
- Fix asyncFlush/syncFlush race: advance writtenCount only after successful write
- Narrow (loader as any) cast to { dispose?(): void }
- Replace flaky 50ms sleeps in tests with deterministic file-size polling
- Expand inline batchFinalizeTimer cleanup to multi-line

P3 fixes:
- Remove redundant _INITIAL suffix from constants (DEFAULT_MAX_TURNS, GRACE_TURNS)
- Shorten over-verbose widget prune comment to inline
- Use queueMicrotask for asyncFlush re-entry instead of direct recursion
- Two-tier cleanup: Tier 1 disposes session but keeps record metadata
  (stats, modelName, result) so get_subagent_result works after 10+ min.
  Tier 2 (clearCompleted) does full removal on session switch.
- Output file fallback: get_subagent_result recovers from on-disk JSONL
  transcript when record is fully cleaned up.
- Error diagnostics: error messages include model name; get_subagent_result
  shows actionable tip on errors.
- Configurable cleanup retention via /agents Settings menu.
- Model name threaded through spawn → notifications → renderer.
- Path traversal guard on agent_id in fallback path.
- Extracted captureStats() helper (DRY: 3 identical blocks → 1 method).
- 20 new tests covering cleanup tiers, stats capture, error diagnostics,
  parseOutputFileResult, and modelName threading.
…stics

- Guard asyncFlush against duplicate JSONL entries when syncFlush runs
  during the await (output-file.ts)
- Bound parseOutputFileResult to last 256KB instead of reading whole file
- Stop appending model info to error string (record.modelName suffices)
- Show "check provider status" tip only for likely API errors, not
  user-initiated stops or timeouts
- Add Tier 2 cleanup (3x retention) to fully remove stale records and
  prevent unbounded Map growth
- Remove `any` types from session_switch handler
- Add debug-level logging for captureStats catch block
- Update tests: tier1/tier2 cutoff coverage, model-name assertion,
  race-condition test stabilization
@vsumner vsumner force-pushed the feat/reliability-improvements branch from 3c563a0 to dd79de0 Compare March 18, 2026 19:10
vsumner added 4 commits March 18, 2026 15:34
Ejecting a default agent with no explicit tool list wrote
"tools: all" in frontmatter. On reload, "all" was parsed
literally as ["all"], which mapped to zero built-in tools
since "all" is not in TOOL_FACTORIES.

Two-sided fix:
- csvList() now treats "all" as equivalent to omitted,
  returning all defaults
- ejectAgent() now writes explicit tool names instead of
  "all" when builtinToolNames is empty/undefined

Test isolation: mock node:os homedir() in custom-agents
tests so global agent files in ~/.pi/agent/agents/ do not
pollute results. Tests now pass without HOME workaround.
Address Node.js review findings:

- csvList: inline parsing logic instead of delegating to
  parseCsvField, eliminating redundant String(val).trim()
- ejectAgent: preserve tools: none round-trip instead of
  silently expanding to all tools when builtinToolNames
  is an empty array
- loadFromDir: add DEBUG-gated console.debug for catch
  blocks so silent failures are diagnosable
- Tests: add global-dir mock wiring test proving agents
  load from mocked homedir with source: global, and
  project-overrides-global test for same-name agents
The csvList "all" fix (returning empty defaults) caused
inheritField to silently disable extensions/skills when
the frontmatter value was the string "all". csvList returns
[] for "all" with empty defaults, which inheritField mapped
to false.

Fix: check for "all" in inheritField before delegating to
csvList, returning true (inherit all) to match the intent.
README:
- Remove duplicate isolation row in frontmatter table
- Document "all" as valid value for tools, extensions,
  and skills fields

Code:
- Add DEBUG-gated logging to bare catch blocks in
  output-file.ts (parseOutputFileResult), memory.ts
  (safeReadFile), and agent-manager.ts (worktree
  cleanup on error) for diagnosability
@tintinweb

Copy link
Copy Markdown
Owner

thanks @vsumner! I'll look into your prs after merging the next feats ♥️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants