Skip to content

Add agent-model-healer: self-healing model fallback for scheduled agents#66

Open
marlandoj wants to merge 1 commit intozocomputer:mainfrom
marlandoj:add-agent-model-healer
Open

Add agent-model-healer: self-healing model fallback for scheduled agents#66
marlandoj wants to merge 1 commit intozocomputer:mainfrom
marlandoj:add-agent-model-healer

Conversation

@marlandoj
Copy link
Copy Markdown

Summary

  • Autonomous watchdog skill that monitors model health across all Zo scheduled agents
  • When a model fails (402 credits, 429 rate limit, 503 unavailable, timeout), automatically switches affected agents to the next healthy fallback model
  • When the original model recovers, restores agents to their preferred model
  • Daily heartbeat email acts as a dead-man switch — if you stop receiving it, the healer or Zo platform is down
  • Zero AI model cost for orchestration: all agent management uses direct MCP API calls, only health probes use /zo/ask (~5 tokens per model)

How it works

  1. Runs as a scheduled agent on zo:fast (every 30 min)
  2. Probes all configured models with a tiny health-check prompt
  3. Switches agents on unhealthy models to fallback chains defined in assets/fallback-chain.json
  4. Restores agents when models recover
  5. Sends email notifications only when switches/restores happen
  6. Sends a daily heartbeat email with model health summary

Safety

The Watchmen Independence Rule prevents the healer from switching its own model (which would create a circular failure). Enforced via config-driven agent ID + title pattern match.

Files

File Purpose
SKILL.md Skill definition with setup and usage instructions
DISPLAY.json Registry metadata (icon, tags)
scripts/healer.ts Main healer engine (Bun/TypeScript, ~400 lines)
assets/fallback-chain.json Example config with placeholder model IDs

Test plan

  • Install skill and verify bun healer.ts probe runs against configured models
  • Verify bun healer.ts status outputs JSON with probe results
  • Create a scheduled agent and confirm auto pipeline completes
  • Confirm self-exclusion by setting healerAgentId in config

🤖 Generated with Claude Code

…ed agents

Autonomous watchdog that monitors model health and switches agents to fallback
models when their primary model fails (402, 429, 503, timeout). Restores agents
when models recover. Daily heartbeat email for dead-man monitoring. Zero AI model
cost for orchestration — only health probes use /zo/ask tokens.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant