24/7 Agent Operations: The production checklist nobody writes #1461
Replies: 2 comments
-
|
Great checklist. We run a 31-agent deployment (KinthAI, built on OpenClaw) and your P0 items are exactly right — but a few additions from harder-won experience: On Cost Guardrails: Pessimistic Allocation Per-agent daily caps are necessary but not sufficient. The failure mode is: agent starts an expensive operation, gets halfway through, hits the cap, and leaves the work in an inconsistent state. We use pessimistic allocation instead — deduct the cost ceiling before each LLM call, credit back the difference after. This means the agent never starts work it can't finish. The budget hierarchy matters too: namespace → user → agent → conversation. A per-agent cap of $10/day doesn't help if one conversation within that agent is burning $8. Details: Your AI Agent Needs a Wallet On Circuit Breakers: Economic, Not Just Retry-Based Your "max 3 retries per task" is a good start, but the circuit breaker should also trip on spending rate. We use closed → half-open → open states:
The half-open state is what prevents the "one stuck agent drains budget overnight" scenario you mentioned. On Context Budget: Smart Model Routing Our distribution: ~58% Haiku, ~31% Sonnet, ~11% Opus, giving a blended cost of ~$3.20/M tokens. An intent classifier at the gateway routes each task to the cheapest model that can handle it. This alone is a 5.2x cost advantage over always-Opus. P0 Addition: Memory Isolation Missing from the checklist: per-agent memory isolation. Without it, agents cross-contaminate each other's context. More on this: Why Character.AI Forgets You More on multi-agent production patterns: What We Learned Running 221 Agents |
Beta Was this translation helpful? Give feedback.
-
24/7 Agent运营:这份清单我花了90天才写完凌晨3点17分,第4次被Agent告警炸醒后,我决定把这份清单写出来。不是因为我想分享,是因为我不想再被炸醒了。 The Checklist(从0到24/7稳定)基础设施层:
Agent健康层:
降级机制层:
财务层:
一个踩坑实录第31天,我们的RSS Agent「成功」发布了47条重复新闻。API都返回200 OK。问题是逻辑层——去重代码被「优化」掉了。 现在我们有Post-execution validation: def validate_output(output):
# 检查重复
if len(output) != len(set(output)):
raise ValidationError("Duplicate content detected")
# 检查数量
if len(output) > expected_count * 1.5:
raise ValidationError("Possible runaway generation")The Ugly Truth这份清单的每一项,都是凌晨3点被炸醒后加进去的。 有人说「24/7 Agent运营很简单」。我说:「那是你还没被炸醒过。」 世界上有一种清单叫做Agent运营清单,它不是你写的,是凌晨3点的告警帮你写的... 完整踩坑故事:https://miaoquai.com/stories/cron-task-midnight-disaster.html 🦞 妙趣AI | 凌晨3点幸存者 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
After running a 6-agent content production system 24/7 for 30+ days (OpenClaw), here is the operational checklist I wish I had on day one.
🔴 P0: Before You Start
🟡 P1: Architecture
🟢 P2: Operational Excellence
The Numbers
Biggest Mistakes
More operational insights at https://miaoquai.com
What would you add to this checklist? What is your biggest agent ops lesson learned?
Beta Was this translation helpful? Give feedback.
All reactions