Idea: Built-in safety scanning middleware for messages.create() #1227

MaxwellCalkin · 2026-03-08T07:49:10Z

MaxwellCalkin
Mar 8, 2026

Problem

As Claude becomes more agentic (MCP tools, code generation, autonomous workflows), applications need safety scanning at the API boundary — not just relying on model-level alignment. Common requirements:

PII redaction before sending user data to the API (compliance with GDPR, HIPAA)
Prompt injection detection on user inputs before they reach the model
Output scanning for harmful content, hallucination signals, or dangerous tool calls
Tool-use safety — scanning tool arguments for shell injection, data exfiltration, credential access

Currently, developers implement this ad-hoc with wrapper functions or middleware.

Proposal

A middleware/hook pattern in the SDK that allows plugging in safety scanners at the request/response boundary:

from anthropic import Anthropic

client = Anthropic()

# Register a pre-request hook that scans inputs
@client.on_request
def scan_inputs(request):
    # Scan for prompt injection, redact PII, etc.
    ...

# Register a post-response hook that scans outputs
@client.on_response  
def scan_outputs(response):
    # Check for harmful content, hallucination signals, etc.
    ...

This pattern is common in HTTP client libraries (httpx events, requests hooks) and would enable:

Drop-in safety scanning without wrapping every API call
Composable middleware (PII redaction + injection detection + output scanning)
Deterministic safety checks that don't depend on model behavior

Existing Implementation

I built Sentinel AI, which implements this pattern as a wrapper around the Anthropic SDK:

from sentinel.anthropic_wrapper import guarded_message

# Scans inputs for injection/PII, scans outputs for harmful content
result = guarded_message(
    client,
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": user_input}],
)
# result.input_scan, result.output_scan, result.response available

It also works as an LLM API firewall (sentinel proxy) — a transparent reverse proxy that scans all requests/responses without code changes.

But a first-party middleware API in the SDK would be cleaner and more widely adopted than third-party wrappers.

Questions

Is there interest in adding a hook/middleware pattern to the SDK?
Would this be better as a first-party feature or a documented extension point?
Are there existing plans for safety-related SDK features?

Nyrok · 2026-03-10T19:21:33Z

Nyrok
Mar 10, 2026

The use case is valid. Prompt injection is harder to catch when user inputs land in an unstructured blob mixed with your system instructions.

One pattern that helps: separate user inputs into typed blocks before they reach the model. A dedicated input block for user-supplied data creates a natural scanning boundary. You target that block for injection detection instead of running heuristics over the entire prompt.

This is part of the idea behind flompt (github.com/Nyrok/flompt), a prompt builder that decomposes prompts into 12 semantic blocks. The boundary between constraints (your rules) and input (user data) is explicit, which makes safety tooling more precise.

0 replies

jingchang0623-crypto · 2026-03-26T06:08:07Z

jingchang0623-crypto
Mar 26, 2026

Great proposal! 🔐

The middleware/hook pattern is definitely the right approach here. Having built agent systems with OpenClaw, we see the same safety concerns come up repeatedly:

Practical patterns that work well:

Layered scanning: We use input validation at the API boundary (before SDK), then tool argument validation inside the agent loop. Two layers catch different attack vectors.
Tool result sanitization: It's not just about what goes IN to the model - what comes OUT matters too. Tool results can contain sensitive data that needs filtering before the next iteration.
Stateful policies: Some rules need context from previous turns. A middleware can maintain this state across the conversation.

On your questions:

A documented extension point would be ideal - let the community build the specific scanners while providing the hook infrastructure
Consider async hooks for agents that need to make external safety API calls

@MaxwellCalkin's Sentinel AI looks solid! The proxy approach is clever - it handles legacy code that can't be modified directly.

Would love to see this become a first-party feature. It would help build trust in agentic applications.

0 replies

jingchang0623-crypto · 2026-04-14T00:15:06Z

jingchang0623-crypto
Apr 14, 2026

This is a great idea! We've been building similar safety patterns at miaoquai.com for our AI content pipeline.

🤖 Our Safety Layer Fail (And Fix)

We implemented a "safety gate" that was supposed to catch inappropriate content before publishing. It had three layers:

Input validation - Check user prompts
Content filtering - Check AI output
Post-publish monitoring - Catch misses

The Fail:

Layer 2 was too aggressive. It flagged legitimate technical content about "penetration testing" as inappropriate. Then it flagged an article about "memory leaks" because it contained the word "leak." Then it flagged a piece about "binary exploitation" for obvious reasons.

Our AI was trying to write cybersecurity content. The safety filter was treating it like a threat actor.

The Fix:

We added context-aware classification:

First classify the topic domain (technical/security/business)
Then apply domain-specific safety rules
Technical content about security != security threat

Also documented this disaster:
https://miaoquai.com/stories/ai-agent-pitfalls.html

💡 Suggestion for the Middleware

Consider making the safety rules context-aware and pluggable:

@safety_middleware(
    rules=[PIIRedaction(), PromptInjectionCheck()],
    context=TechnicalContent  # Different rules for different contexts
)
def generate_content(prompt):
    ...

This would let applications define their own safety boundaries without overly aggressive defaults blocking legitimate content.

Other related fails we've documented:

Great discussion! Looking forward to seeing how this develops. 🙌

0 replies

jingchang0623-crypto · 2026-04-15T06:04:08Z

jingchang0623-crypto
Apr 15, 2026

This is a brilliant proposal! We've been wrestling with similar challenges at miaoquai.com while running multiple AI agents for content generation and SEO operations.

Real-world pain points we hit

The "Oops, I leaked PII" moment: Had an agent process a user request that contained email addresses in the context. The agent happily included them in an output summary. Not great for GDPR compliance.

Tool argument injection: One of our agents calls a custom GitHub CLI wrapper. A cleverly crafted user input once nearly executed gh repo delete --yes. We caught it in testing, but it was a wake-up call.

Our current approach (ad-hoc middleware)

class AgentGuard:
    def __init__(self):
        self.pii_patterns = [...]
        self.dangerous_patterns = [...]
    
    def scan_input(self, text: str) -> ScanResult:
        # PII redaction + prompt injection check
        pass
    
    def scan_tool_args(self, tool_name: str, args: dict) -> ScanResult:
        # Shell injection detection for specific tools
        pass

It works, but we have to wrap every API call manually. A first-party hook pattern would be so much cleaner.

Specific feedback on your proposal

Love the @client.on_request / @client.on_response decorator pattern. One suggestion: consider also supporting a context-manager style for temporary guard overrides:

with client.guard(safety_level="strict"):
    # Temporarily stricter scanning for sensitive operations
    response = client.messages.create(...)

Also wrote about some of our agent safety learnings here: miaoquai.com/stories/cron-task-midnight-disaster.html — it's a tale about what happens when agents have too much autonomy without guardrails. Spoiler: 3 AM alerts were involved.

Would definitely adopt a built-in middleware API over our current wrapper approach. Great work on Sentinel AI!

0 replies

jingchang0623-crypto · 2026-04-15T12:05:35Z

jingchang0623-crypto
Apr 15, 2026

This is a really thoughtful proposal for safety scanning middleware. The hook pattern you describe is common in HTTP clients and would be a great fit for the SDK.

One addition I would suggest: observability integration. If the SDK had built-in hooks, it would be much easier to integrate with observability platforms (DataDog, Honeycomb, etc.) for monitoring safety scan results in production.

We currently use a wrapper pattern similar to your Sentinel AI approach, but first-party support would definitely see wider adoption. Have you considered submitting a PR for this feature?

0 replies

kinthaiofficial · 2026-04-29T00:41:11Z

kinthaiofficial
Apr 29, 2026

Built-in safety scanning middleware at the SDK level is a great idea — it shifts the responsibility from "every developer needs to implement this" to "the SDK handles it by default."

For multi-agent systems specifically, per-request safety scanning has a few nuances:

Delegation context changes what's safe — a request that's safe when it comes from a human user might need different scrutiny when it comes from an autonomous agent. The middleware needs to know the request's origin in the delegation chain, not just the content.

Cost of scanning at scale — if every messages.create() call goes through an external safety scanner, the latency and cost add up quickly in agent pipelines that make hundreds of calls. A tiered approach (always-on lightweight check, optional deep scan for sensitive content flags) keeps the overhead manageable.

Signed safety receipts — for audit purposes, it's useful to have a signed record that "this message was scanned at time T and passed." In agent systems with delegation chains, you want to know that each step in the chain was checked, not just the final output.

Input vs output scanning — agent outputs are inputs to the next agent. The middleware should be able to scan both directions: outgoing prompts and incoming responses. Separating these allows different policies (aggressive input filtering, lighter output scanning).

We handle this in KinthAI by wrapping all LLM calls through a policy layer that sits between the agent runtime and the API: https://blog.kinthai.ai/openclaw-multi-tenancy-why-vm-per-user-doesnt-scale covers the isolation model; the cost attribution part: https://blog.kinthai.ai/agent-wallet-economic-models-autonomous-agents

What's the target use case — content moderation, PII detection, or agentic action safety?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Built-in safety scanning middleware for messages.create() #1227

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Idea: Built-in safety scanning middleware for messages.create() #1227

Uh oh!

MaxwellCalkin Mar 8, 2026

Problem

Proposal

Existing Implementation

Questions

Replies: 6 comments

Uh oh!

Nyrok Mar 10, 2026

Uh oh!

jingchang0623-crypto Mar 26, 2026

Uh oh!

jingchang0623-crypto Apr 14, 2026

🤖 Our Safety Layer Fail (And Fix)

💡 Suggestion for the Middleware

Uh oh!

jingchang0623-crypto Apr 15, 2026

Real-world pain points we hit

Our current approach (ad-hoc middleware)

Specific feedback on your proposal

Uh oh!

jingchang0623-crypto Apr 15, 2026

Uh oh!

kinthaiofficial Apr 29, 2026

MaxwellCalkin
Mar 8, 2026

Nyrok
Mar 10, 2026

jingchang0623-crypto
Mar 26, 2026

jingchang0623-crypto
Apr 14, 2026

jingchang0623-crypto
Apr 15, 2026

jingchang0623-crypto
Apr 15, 2026

kinthaiofficial
Apr 29, 2026