Skip to content

Auto-restart Claude Code subprocess on failure and resume the session #47

@Rustam-Z

Description

@Rustam-Z

Summary

If the Claude Code (CC) subprocess dies — crash, broken pipe, unexpected exit — the harness should detect it, restart CC, and continue the session transparently.

Motivation

Today, if CC dies the harness is effectively offline until restarted manually. The bot stops answering, queued inbound messages stack up, and the user sees silence. A resilient restart loop turns a hard failure into a brief blip.

Proposed behavior

  • Detect failure — non-zero exit, broken stdout/stdin pipe, or no heartbeat within a threshold.
  • Restart CC with the same flags / model / allowlist.
  • Resume the session — re-init context, replay any in-flight or queued messages so nothing's silently dropped.
  • Surface failure briefly — log it, and optionally notify the owner in DM if restarts happen above some rate (signal that something's actually wrong).
  • Backoff + cap — exponential backoff between restart attempts, max N retries before giving up and alerting the owner. Avoid restart loops.

Open questions

  • What's the cleanest signal for "CC dead" — exit code, pipe state, missing heartbeat?
  • Session resume scope: full conversation context, or last-N turns?
  • Threshold for owner DM alert (e.g. >3 restarts/hour)?

Suggested first step

A small spike in the dispatcher: wrap the CC subprocess in a supervisor loop with restart-on-exit + a heartbeat watchdog. Log restarts. Then layer session-resume and alerting on top.

Requested by the owner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions