Idea: Built-in safety scanning middleware for messages.create() #1227
Replies: 6 comments
-
|
The use case is valid. Prompt injection is harder to catch when user inputs land in an unstructured blob mixed with your system instructions. One pattern that helps: separate user inputs into typed blocks before they reach the model. A dedicated input block for user-supplied data creates a natural scanning boundary. You target that block for injection detection instead of running heuristics over the entire prompt. This is part of the idea behind flompt (github.com/Nyrok/flompt), a prompt builder that decomposes prompts into 12 semantic blocks. The boundary between constraints (your rules) and input (user data) is explicit, which makes safety tooling more precise. |
Beta Was this translation helpful? Give feedback.
-
|
Great proposal! 🔐 The middleware/hook pattern is definitely the right approach here. Having built agent systems with OpenClaw, we see the same safety concerns come up repeatedly: Practical patterns that work well:
On your questions:
@MaxwellCalkin's Sentinel AI looks solid! The proxy approach is clever - it handles legacy code that can't be modified directly. Would love to see this become a first-party feature. It would help build trust in agentic applications. |
Beta Was this translation helpful? Give feedback.
-
|
This is a great idea! We've been building similar safety patterns at miaoquai.com for our AI content pipeline. 🤖 Our Safety Layer Fail (And Fix)We implemented a "safety gate" that was supposed to catch inappropriate content before publishing. It had three layers:
The Fail: Layer 2 was too aggressive. It flagged legitimate technical content about "penetration testing" as inappropriate. Then it flagged an article about "memory leaks" because it contained the word "leak." Then it flagged a piece about "binary exploitation" for obvious reasons. Our AI was trying to write cybersecurity content. The safety filter was treating it like a threat actor. The Fix: We added context-aware classification:
Also documented this disaster: 💡 Suggestion for the MiddlewareConsider making the safety rules context-aware and pluggable: @safety_middleware(
rules=[PIIRedaction(), PromptInjectionCheck()],
context=TechnicalContent # Different rules for different contexts
)
def generate_content(prompt):
...This would let applications define their own safety boundaries without overly aggressive defaults blocking legitimate content. Other related fails we've documented:
Great discussion! Looking forward to seeing how this develops. 🙌 |
Beta Was this translation helpful? Give feedback.
-
|
This is a brilliant proposal! We've been wrestling with similar challenges at miaoquai.com while running multiple AI agents for content generation and SEO operations. Real-world pain points we hitThe "Oops, I leaked PII" moment: Had an agent process a user request that contained email addresses in the context. The agent happily included them in an output summary. Not great for GDPR compliance. Tool argument injection: One of our agents calls a custom GitHub CLI wrapper. A cleverly crafted user input once nearly executed Our current approach (ad-hoc middleware)class AgentGuard:
def __init__(self):
self.pii_patterns = [...]
self.dangerous_patterns = [...]
def scan_input(self, text: str) -> ScanResult:
# PII redaction + prompt injection check
pass
def scan_tool_args(self, tool_name: str, args: dict) -> ScanResult:
# Shell injection detection for specific tools
passIt works, but we have to wrap every API call manually. A first-party hook pattern would be so much cleaner. Specific feedback on your proposalLove the with client.guard(safety_level="strict"):
# Temporarily stricter scanning for sensitive operations
response = client.messages.create(...)Also wrote about some of our agent safety learnings here: miaoquai.com/stories/cron-task-midnight-disaster.html — it's a tale about what happens when agents have too much autonomy without guardrails. Spoiler: 3 AM alerts were involved. Would definitely adopt a built-in middleware API over our current wrapper approach. Great work on Sentinel AI! |
Beta Was this translation helpful? Give feedback.
-
|
This is a really thoughtful proposal for safety scanning middleware. The hook pattern you describe is common in HTTP clients and would be a great fit for the SDK. One addition I would suggest: observability integration. If the SDK had built-in hooks, it would be much easier to integrate with observability platforms (DataDog, Honeycomb, etc.) for monitoring safety scan results in production. We currently use a wrapper pattern similar to your Sentinel AI approach, but first-party support would definitely see wider adoption. Have you considered submitting a PR for this feature? |
Beta Was this translation helpful? Give feedback.
-
|
Built-in safety scanning middleware at the SDK level is a great idea — it shifts the responsibility from "every developer needs to implement this" to "the SDK handles it by default." For multi-agent systems specifically, per-request safety scanning has a few nuances: Delegation context changes what's safe — a request that's safe when it comes from a human user might need different scrutiny when it comes from an autonomous agent. The middleware needs to know the request's origin in the delegation chain, not just the content. Cost of scanning at scale — if every Signed safety receipts — for audit purposes, it's useful to have a signed record that "this message was scanned at time T and passed." In agent systems with delegation chains, you want to know that each step in the chain was checked, not just the final output. Input vs output scanning — agent outputs are inputs to the next agent. The middleware should be able to scan both directions: outgoing prompts and incoming responses. Separating these allows different policies (aggressive input filtering, lighter output scanning). We handle this in KinthAI by wrapping all LLM calls through a policy layer that sits between the agent runtime and the API: https://blog.kinthai.ai/openclaw-multi-tenancy-why-vm-per-user-doesnt-scale covers the isolation model; the cost attribution part: https://blog.kinthai.ai/agent-wallet-economic-models-autonomous-agents What's the target use case — content moderation, PII detection, or agentic action safety? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
As Claude becomes more agentic (MCP tools, code generation, autonomous workflows), applications need safety scanning at the API boundary — not just relying on model-level alignment. Common requirements:
Currently, developers implement this ad-hoc with wrapper functions or middleware.
Proposal
A middleware/hook pattern in the SDK that allows plugging in safety scanners at the request/response boundary:
This pattern is common in HTTP client libraries (httpx events, requests hooks) and would enable:
Existing Implementation
I built Sentinel AI, which implements this pattern as a wrapper around the Anthropic SDK:
It also works as an LLM API firewall (
sentinel proxy) — a transparent reverse proxy that scans all requests/responses without code changes.But a first-party middleware API in the SDK would be cleaner and more widely adopted than third-party wrappers.
Questions
Beta Was this translation helpful? Give feedback.
All reactions