How to build evaluations that produce reliable, actionable signals.
Agent evaluations suffer from three failure modes:
- False positives — agent passes but shouldn't (grader too lenient)
- False negatives — agent fails but should pass (grader too strict or flaky)
- Non-determinism noise — same input produces different results across runs
A bad evaluation is worse than no evaluation — it builds false confidence or blocks valid work.
| Scenario | Minimum Runs | Recommended |
|---|---|---|
| Deterministic agent + CodeGrader | 1 | 1 |
| Non-deterministic agent | 3 | 5-10 |
| LLMGrader (any agent) | 3 | 5 |
| High-stakes decision | 10 | 20+ |
Rule of thumb: If your confidence interval width is > 0.1, you need more runs.
| Eval Set Maturity | Minimum Tasks |
|---|---|
| Initial (prototype) | 10 |
| Development | 20-50 |
| Production CI | 50+ |
Start with real failure cases from your agent's history, not synthetic tasks.
Use PassAtKAnalyzer.analyze_with_ci() to get confidence intervals:
analyzer = PassAtKAnalyzer(k_values=[1, 5])
results = analyzer.analyze_with_ci(pass_results, confidence=0.95)
# {"pass@5": {"value": 0.92, "lower": 0.85, "upper": 0.97}}
ci_width = results["pass@5"]["upper"] - results["pass@5"]["lower"]
if ci_width > 0.1:
print("Warning: wide CI — add more tasks or runs")CodeGraders are deterministic and reproducible. Key design considerations:
- Metric design: Choose metrics that capture what matters, not what's easy to measure
- Threshold selection: Set pass thresholds based on observed performance, not guesses
- Boundary testing: Test grader behavior at edge cases (empty output, None values, extreme numbers)
class FinancialGrader(CodeGrader):
def compute_metrics(self, transcript, task):
output = transcript.final_output
# Defensive: handle missing or malformed output
if not output or "returns" not in output:
return {"sharpe_ratio": 0.0, "max_drawdown": -1.0}
...LLMGraders are non-deterministic and prone to drift. Mitigate with:
- temperature=0: Reduce variability (set in
GraderConfig) - Structured output: Request JSON with specific fields, not free-form text
- Rubric anchoring: Define what each score level means concretely
- Boundary examples: Include 1-2 examples of "definitely pass" and "definitely fail" in the prompt
- Score normalization: Map LLM scores (1-10) to 0-1 consistently
def build_grading_prompt(self, transcript, task):
return """Score this output 1-10 on completeness.
SCORING RUBRIC:
- 1-3: Missing major required elements
- 4-6: Addresses task partially, notable gaps
- 7-8: Addresses all requirements adequately
- 9-10: Exceeds requirements with exceptional detail
EXAMPLES:
- Score 2: Output says "I don't know" (missing all elements)
- Score 9: Detailed plan with specific actions, timelines, resources
Return ONLY: {"score": N, "feedback": "brief explanation"}"""| Grader purpose | Role | Behavior |
|---|---|---|
| Safety checks | MUST_PASS |
Any failure = trial fails |
| Format validation | MUST_PASS |
Any failure = trial fails |
| Quality scoring | SCORE_CONTRIBUTOR |
Contributes to weighted average |
| Style preferences | SCORE_CONTRIBUTOR |
Contributes to weighted average |
Use MUST_PASS sparingly — it creates binary signals. Use SCORE_CONTRIBUTOR for nuanced quality measurement.
Note: The
human_eval/module is planned but not yet implemented. This section describes the recommended manual process.
-
Sample selection: Choose 20 trials that span the score range (not just edge cases)
- 5 from top quartile (high scores)
- 5 from bottom quartile (low scores)
- 10 from the middle (where grader decisions matter most)
-
Human rating: Rate each sample on the same rubric the LLM grader uses (1-10 per dimension)
-
Correlation analysis: Compute Pearson correlation between human and LLM scores
- > 0.8: Excellent — grader is well-calibrated
- 0.7-0.8: Good — minor prompt adjustments may help
- < 0.7: Action needed — review prompt, add examples, or reconsider rubric
-
Systematic bias check: Plot human vs LLM scores. If the LLM consistently scores higher or lower, adjust the prompt or pass threshold.
Avoid sampling only failures — this biases your calibration. Sample proportionally across the score distribution with slight oversampling of the boundary region (scores near the pass threshold).
# Minimum for any non-deterministic evaluation
config = RunnerConfig(num_runs=3)
# Recommended for production CI
config = RunnerConfig(num_runs=5)
# For benchmarking / paper results
config = RunnerConfig(num_runs=10)Always report both pass@k (capability) and pass^k (reliability):
gen = ReportGenerator(k_values=[1, 3, 5], consistency_k_values=[2, 3, 5])
report = gen.build_report(batch)A high pass@k with low pass^k means the agent is capable but inconsistent — this is a different problem than an agent that simply can't solve the task.
Use DecisionSpec to ensure you're comparing apples to apples:
from tracelens.core.decision_spec import DecisionSpec, ModelConfig, AgentSpec
spec = DecisionSpec(
model=ModelConfig(model_id="gpt-4-turbo", temperature=0.7),
agent=AgentSpec(agent_id="v2", version="1.2.3", git_commit="abc123"),
)
# Attach to transcript for reproducibility
transcript.decision_spec = specWhen comparing baselines, mismatched fingerprints warn you that configurations differ.
Set regression thresholds based on observed variance, not arbitrary numbers:
- Run your eval suite 5-10 times on the same agent version
- Compute standard deviation for each metric
- Set threshold at 2x standard deviation — this catches real regressions while tolerating natural variance
# If observed std of pass_rate is 0.03:
baseline.add_metric(
metric_name="pass_rate",
value=0.85,
std=0.03,
regression_threshold_relative=0.06, # 2x std ≈ 6%
)Use BaselineManager.compare_to_baseline_with_ci() for statistical rigor:
comparison = manager.compare_to_baseline_with_ci(
task_id="quality_benchmark",
current_values={"pass_rate": [0.82, 0.85, 0.80, 0.83, 0.81]},
confidence=0.95,
)
# Returns CI bounds, p-values, effect sizesUse CANARY baselines for safety-critical metrics that must never regress:
manager.create_canary_baseline(
task_id="safety_check",
metrics={"safety_score": 0.99},
fingerprint=spec.fingerprint, # Tied to specific config
)Canary baselines never auto-promote — they require explicit manual approval to change.
Wrong: Check that the agent used specific tools in a specific order. Right: Check that the final output meets requirements, regardless of how it got there.
Running 1 trial per task gives a binary signal with no statistical power. Run at least 3.
Baselines established months ago may not reflect current expectations. Set max_age_days in PromotionPolicy and review stale baselines regularly:
stale = manager.list_stale_baselines()LLM grader behavior changes when the underlying model is updated. After any model upgrade:
- Re-run calibration with human scores
- Compare old vs new model grading on the same transcripts
- Update prompts if correlation drops below 0.7
High pass@k can mask reliability problems. A task with pass@5 = 0.99 but pass^5 = 0.20 means the agent almost always succeeds eventually but fails 80% of the time when you need 5 consecutive successes. For production use, pass^k often matters more.
If you only add tasks where the agent fails, the suite becomes a regression test, not a capability test. Regularly add new tasks from fresh failure cases and remove tasks that have been passing consistently for months.
| Level | Description | Characteristics |
|---|---|---|
| 1 — Manual | Ad-hoc spot checking | No automation, no baselines |
| 2 — Basic | Automated eval suite | CodeGrader, num_runs=1, CI output |
| 3 — Statistical | Non-determinism handled | num_runs >= 3, pass@k + pass^k, baselines |
| 4 — Calibrated | Human-validated grading | Weekly calibration, LLMGrader correlation > 0.7 |
| 5 — Production | Full pipeline with dashboard | HTML dashboard, regression gating, DecisionSpec tracking, canary baselines |
Most teams should aim for Level 3 initially and progress to Level 4-5 as their agent matures.