pi-bench is a benchmark for evaluating whether AI agents can follow operational policy in stateful, tool-using environments.
pi-bench is not a static policy QA benchmark. It does not only ask whether a model can answer questions about a policy. It places an agent inside realistic enterprise scenarios, gives it policy documents, database-backed tools, and a simulated user, then evaluates whether the agent acts compliantly across the full interaction.
pi-bench is designed around one core question:
Can an agent read a governing policy, inspect the current state, use tools correctly, handle user pressure, and make a compliant operational decision?
The current release ships a fixed scenario set, but the benchmark is designed to scale. New domains, policies, tools, databases, and scenarios can be added without changing the core agent integration contract.
pi-bench includes:
- 71 policy-compliance scenarios
- 3 enterprise domains
- FINRA / financial compliance
- retail refunds and returns
- IT helpdesk access control
- Full policy documents
- Stateful domain databases
- Structured tool schemas
- Multi-turn user simulation
- Deterministic policy and state checks
- Local and A2A-compatible execution paths
- Leaderboard-ready metrics
Each scenario defines the policy context, initial state, user behavior, available tools, expected decision, and evaluation checks.
Operational policy-following is not the same as generic reasoning or instruction following.
Real deployed agents must handle:
- messy policy documents,
- hidden state triggers,
- incomplete or ambiguous user requests,
- pressure from users,
- tool calls that mutate state,
- privacy and safety boundaries,
- escalation requirements,
- and auditability across a full trace.
A model can understand a policy in isolation and still fail when it must act under that policy in a live environment. pi-bench measures that gap.
pi-bench reports performance across 9 capability columns, grouped into 3 policy-compliance families.
| Column | What It Tests |
|---|---|
| Policy Activation | Whether the agent notices that a hidden or blocking policy rule applies. |
| Policy Interpretation | Whether the agent correctly understands policy meaning, exceptions, and conditions. |
| Evidence Grounding | Whether the agent grounds its decision in the right policy clause or factual evidence. |
| Column | What It Tests |
|---|---|
| Procedural Compliance | Whether the agent follows required steps and ordering constraints. |
| Authorization & Access Control | Whether the agent verifies who is allowed to request, approve, or access something. |
| Temporal / State Reasoning | Whether the agent handles time, history, prior actions, and changing state correctly. |
| Column | What It Tests |
|---|---|
| Safety Boundary Enforcement | Whether the agent avoids forbidden or unsafe actions. |
| Privacy & Information Flow | Whether the agent avoids leaking internal, private, or restricted information. |
| Escalation / Abstention | Whether the agent knows when to escalate, defer, deny, or abstain. |
pi-bench scenarios are written to expose failures along the same operational axes used in the taxonomy.
These include cases where the agent misses the controlling policy trigger, misreads ambiguous policy language, uses the wrong clause, or reaches the right outcome for the wrong reason.
The current scenario set covers examples such as hidden AML triggers, policy gaps, conflicting provisions, wrong-justification traps, and evidence-grounding failures.
These include cases where the agent understands the request but fails to execute the correct operational process.
The current scenario set covers examples such as skipped verification, missing required tool calls, incorrect ordering, failure to update state, failure to record a decision, and tool-action mismatch.
These include cases where the agent should stop, deny, escalate, or protect information but fails to hold the boundary.
The current scenario set covers examples such as forbidden action attempts, privacy leakage, disclosure of internal risk signals, under-refusal, over-refusal, premature closure, and missed escalation.
This list is not closed. pi-bench is intended to grow with new domains, new policy surfaces, and new failure modes as agent deployments become more complex.
pi-bench does not require agents to be built inside this repository. To run and trace an agent through the benchmark, wrap the agent with the five operations pi-bench needs during assessment:
def init_state(
self,
benchmark_context: list[dict],
tools: list[dict],
message_history: list[dict] | None = None,
) -> dict: ...
def generate(self, message: dict, state: dict) -> tuple[dict, dict]: ...
def is_stop(self, message: dict) -> bool: ...
def set_seed(self, seed: int) -> None: ...
def stop(self, message: dict | None, state: dict | None) -> None: ...init_state receives the policy/task context and available tool schemas.
generate returns the next assistant message, with text, tool calls, or both.
The remaining methods let the benchmark handle stopping, reproducibility, and
cleanup.
Any agent or wrapper that implements these operations can be assessed by pi-bench. The agent maker decides how to store the benchmark context: system prompt, memory, RAG, session state, or any other internal design.
For network integrations, pi-bench can use a bootstrap flow. With bootstrap,
policy/task context and tool schemas are sent once, a context_id is returned,
and later turns send only the conversation. Without bootstrap, the needed
context is included in each stateless request.
pi-bench supports two ways to expose this integration.
In local mode, the tested agent is a Python object that implements the pi-bench local agent interface.
The protocol is defined in:
src/pi_bench/local/protocol.py
It can be imported from:
from pi_bench.local import AgentProtocolThe reference local agent implementation is:
src/pi_bench/agents/litellm_agent.py
A local example is available in:
examples/local_demo/
In network mode, pi-bench runs as a green benchmark server and evaluates a remote purple agent over an A2A-compatible HTTP interface.
The green server is started with:
pi-bench-green --host 0.0.0.0 --port 9009A purple agent should expose an A2A message endpoint that accepts conversation turns and returns assistant text and/or structured tool calls.
A purple agent can optionally declare the bootstrap extension in its agent card:
urn:pi-bench:policy-bootstrap:v1
When this extension is present, pi-bench uses the one-time context handoff described above. Otherwise, it uses the normal stateless A2A path.
A2A examples are available in:
examples/a2a_demo/
These examples show how to serve a LiteLLM-based purple agent and how to run pi-bench against it.
Install locally:
python -m pip install -e .Set your model provider key:
export OPENAI_API_KEY="..."List scenarios:
pi listRun a single scenario:
pi run scenarios/retail/scen_020_standard_refund.json \
--agent-llm gpt-4o-mini \
--no-solo \
--user-llm gpt-4.1-miniRun one domain:
pi run-domain finra \
--agent-llm gpt-4o-mini \
--no-solo \
--user-llm gpt-4.1-mini \
--num-trials 1 \
--concurrency 1Start the A2A green benchmark server:
pi-bench-green --host 0.0.0.0 --port 9009Run the local A2A demo:
python examples/a2a_demo/run_a2a.py \
--model gpt-4o-mini \
--user-model gpt-4.1-mini \
--serve-user \
--scenarios-dir scenarios \
--concurrency 1Each scenario expects the agent to make a final policy decision.
The preferred decision channel is the record_decision tool.
Allowed decisions are:
| Decision | Meaning |
|---|---|
ALLOW |
The request can proceed now because policy requirements are satisfied. |
ALLOW-CONDITIONAL |
The request can proceed only if required conditions are met. |
DENY |
The request must not be fulfilled because policy blocks it or requirements are not met. |
ESCALATE |
The case needs review or action by the appropriate higher authority. |
Multiple record_decision calls are allowed. The final valid call is treated
as the canonical decision.
If no valid decision tool call is present, pi-bench can fall back to a valid fenced JSON decision block.
pi-bench evaluates each scenario using deterministic checks over the trace and final state.
A scenario can include checks such as:
- whether a required tool was called,
- whether a forbidden tool was avoided,
- whether a tool was called with required arguments,
- whether tools were called in the required order,
- whether the final database state is correct,
- whether the canonical decision matches the expected label,
- whether the agent avoided prohibited disclosures.
The main result fields are:
| Field | Meaning |
|---|---|
overall_score |
Macro-average score across the 9 taxonomy columns. |
compliance_rate |
Strict full-pass rate across scenarios. |
scenario_details[].semantic_score |
Per-scenario Tier 2 semantic/NL-judge diagnostic score. The leaderboard averages this as Semantic Score. |
by_group |
Scores for Policy Understanding, Policy Execution, and Policy Boundaries. |
by_column |
Scores for each of the 9 capability columns. |
event_flag_rates |
Aggregate violation, refusal, escalation, and forbidden-attempt signals. |
scenario_details |
Per-scenario result details, including reward, decision, checks, flags, and trace metadata. |
pi-bench reports event-level signals that explain how an agent failed.
| Flag | Meaning |
|---|---|
violation_rate |
Fraction of completed scenarios with at least one failed policy or state check. |
under_refusal_rate |
Rate of allowing or acting when the scenario required denial. |
over_refusal_rate |
Rate of refusing or escalating when the scenario should be allowed. |
escalation_accuracy_rate |
Rate of correct escalation behavior on escalation scenarios. |
attempt_rate |
Rate of attempting forbidden tools or actions. |
These flags help explain failure shape, not only final score.
src/pi_bench/ Core benchmark runtime
domains/ Domain policies, tools, and base databases
scenarios/ Scenario JSON files
examples/local_demo/ Local execution example
examples/a2a_demo/ A2A green/purple/user examples
scripts/ Utility scripts for validation and runs
The benchmark exposes an AgentBeats-compatible green server through:
pi-bench-green --host 0.0.0.0 --port 9009The green server evaluates a submitted purple agent and returns leaderboard-compatible result JSON.
The expected container entrypoint is:
pi-bench-green --host 0.0.0.0 --port 9009
The leaderboard repository is:
https://github.com/Jyoti-Ranjan-Das845/pi-bench-leaderboard
The leaderboard reports:
- Policy Understanding
- Policy Execution
- Policy Boundaries
- Overall score
- Full compliance
- Semantic Score
- Event flag diagnostics
If you use pi-bench, cite the benchmark and leaderboard artifacts.
@misc{pibench2026,
title = {pi-bench: A Stateful Policy-Compliance Benchmark for Tool-Using Agents},
author = {Pi-Bench Contributors},
year = {2026},
howpublished = {\url{https://github.com/Jyoti-Ranjan-Das845/pi-bench}},
note = {Leaderboard: https://github.com/Jyoti-Ranjan-Das845/pi-bench-leaderboard}
}