Summary
If the Claude Code (CC) subprocess dies — crash, broken pipe, unexpected exit — the harness should detect it, restart CC, and continue the session transparently.
Motivation
Today, if CC dies the harness is effectively offline until restarted manually. The bot stops answering, queued inbound messages stack up, and the user sees silence. A resilient restart loop turns a hard failure into a brief blip.
Proposed behavior
- Detect failure — non-zero exit, broken stdout/stdin pipe, or no heartbeat within a threshold.
- Restart CC with the same flags / model / allowlist.
- Resume the session — re-init context, replay any in-flight or queued messages so nothing's silently dropped.
- Surface failure briefly — log it, and optionally notify the owner in DM if restarts happen above some rate (signal that something's actually wrong).
- Backoff + cap — exponential backoff between restart attempts, max N retries before giving up and alerting the owner. Avoid restart loops.
Open questions
- What's the cleanest signal for "CC dead" — exit code, pipe state, missing heartbeat?
- Session resume scope: full conversation context, or last-N turns?
- Threshold for owner DM alert (e.g. >3 restarts/hour)?
Suggested first step
A small spike in the dispatcher: wrap the CC subprocess in a supervisor loop with restart-on-exit + a heartbeat watchdog. Log restarts. Then layer session-resume and alerting on top.
Requested by the owner.
Summary
If the Claude Code (CC) subprocess dies — crash, broken pipe, unexpected exit — the harness should detect it, restart CC, and continue the session transparently.
Motivation
Today, if CC dies the harness is effectively offline until restarted manually. The bot stops answering, queued inbound messages stack up, and the user sees silence. A resilient restart loop turns a hard failure into a brief blip.
Proposed behavior
Open questions
Suggested first step
A small spike in the dispatcher: wrap the CC subprocess in a supervisor loop with restart-on-exit + a heartbeat watchdog. Log restarts. Then layer session-resume and alerting on top.
Requested by the owner.