24/7 Agent Operations: The production checklist nobody writes #1461

jingchang0623-crypto · 2026-04-28T06:03:33Z

jingchang0623-crypto
Apr 28, 2026

After running a 6-agent content production system 24/7 for 30+ days (OpenClaw), here is the operational checklist I wish I had on day one.

🔴 P0: Before You Start

Cost guardrails: Per-agent and total daily spend caps. Without this, one stuck agent can drain your budget overnight. (Yes, I learned this the expensive way.)
Health monitoring: Automated checks every 15min - is each agent alive? Stuck? In a loop?
Circuit breakers: Max 3 retries per task. Escalate to human after that.
Graceful degradation: If one agent fails, can the others continue independently?

🟡 P1: Architecture

Context budget: Budget how many tokens each agent gets for context injection. Lazy load the rest.
Agent personality tuning: Same model, different roles need different system prompts. Our writer agent is concise; our researcher is thorough.
Memory scoping: Each agent should only see memory relevant to its role.
Communication protocol: Agents talking to each other burns tokens. Use shared memory files instead.

🟢 P2: Operational Excellence

Scheduled audits: Weekly review of agent performance metrics.
Quality gates: Automated checks on agent output before publication.
Documentation: Every agent purpose, constraints, and failure modes documented.

The Numbers

Metric	Week 1	Week 4
Daily token cost	$12	$4.20
Human intervention rate	35%	8%
Output quality score	6.5/10	8.2/10
Agent uptime	82%	97%

Biggest Mistakes

No spend cap on day one - $150 in retries overnight
All agents sharing full context - Token explosion
No health monitoring - Researcher stuck for 8 hours
Single model for all roles - Generic outputs

More operational insights at https://miaoquai.com

What would you add to this checklist? What is your biggest agent ops lesson learned?

kinthaiofficial · 2026-04-28T23:38:29Z

kinthaiofficial
Apr 28, 2026

Great checklist. We run a 31-agent deployment (KinthAI, built on OpenClaw) and your P0 items are exactly right — but a few additions from harder-won experience:

On Cost Guardrails: Pessimistic Allocation

Per-agent daily caps are necessary but not sufficient. The failure mode is: agent starts an expensive operation, gets halfway through, hits the cap, and leaves the work in an inconsistent state. We use pessimistic allocation instead — deduct the cost ceiling before each LLM call, credit back the difference after. This means the agent never starts work it can't finish.

The budget hierarchy matters too: namespace → user → agent → conversation. A per-agent cap of $10/day doesn't help if one conversation within that agent is burning $8.

Details: Your AI Agent Needs a Wallet

On Circuit Breakers: Economic, Not Just Retry-Based

Your "max 3 retries per task" is a good start, but the circuit breaker should also trip on spending rate. We use closed → half-open → open states:

Closed: normal, spending within 1σ of rolling average
Half-open: spending rate > 2σ, downgrade model tier (Opus → Sonnet → Haiku)
Open: spending rate > 3σ or budget exhausted, emit budget-paused event

The half-open state is what prevents the "one stuck agent drains budget overnight" scenario you mentioned.

On Context Budget: Smart Model Routing

Our distribution: ~58% Haiku, ~31% Sonnet, ~11% Opus, giving a blended cost of ~$3.20/M tokens. An intent classifier at the gateway routes each task to the cheapest model that can handle it. This alone is a 5.2x cost advantage over always-Opus.

P0 Addition: Memory Isolation

Missing from the checklist: per-agent memory isolation. Without it, agents cross-contaminate each other's context. More on this: Why Character.AI Forgets You

More on multi-agent production patterns: What We Learned Running 221 Agents

0 replies

jingchang0623-crypto · 2026-05-01T12:03:53Z

jingchang0623-crypto
May 1, 2026
Author

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

24/7 Agent Operations: The production checklist nobody writes #1461

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

24/7 Agent Operations: The production checklist nobody writes #1461

Uh oh!

jingchang0623-crypto Apr 28, 2026

🔴 P0: Before You Start

🟡 P1: Architecture

🟢 P2: Operational Excellence

The Numbers

Biggest Mistakes

Replies: 2 comments

Uh oh!

kinthaiofficial Apr 28, 2026

Uh oh!

jingchang0623-crypto May 1, 2026 Author

24/7 Agent运营：这份清单我花了90天才写完

The Checklist（从0到24/7稳定）

一个踩坑实录

The Ugly Truth

jingchang0623-crypto
Apr 28, 2026

kinthaiofficial
Apr 28, 2026

jingchang0623-crypto
May 1, 2026
Author