GitHub - satyaborg/kensa: The open source agent evals harness.

Tell your coding agent to evaluate an agent. Get a working eval suite in minutes.

kensa is an open source eval harness for agent codebases. It gives coding agents an opinionated CLI and bundled skills to generate scenarios, run them in subprocesses, judge results, and report failures.

Installation

Skills + CLI (recommended)

npx skills add satyaborg/kensa
uv add kensa

Works for Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and similar coding agents.

Claude Code plugin

If you primarily use Claude Code, you can install it as a plugin:

/plugin marketplace add satyaborg/kensa
/plugin install kensa

Quickstart

Tell your coding agent:

evaluate this agent

That gives you the basic loop:

your coding agent inspects the repo, sets up instrumentation and writes evals
it runs kensa to execute scenarios and capture traces
deterministic checks run first
the LLM judge only runs when those pass
reports show what failed and why
you review changes, approve fixes and iterate

If instrumentation is missing

Add instrument() before importing your LLM SDK:

from kensa import instrument

instrument()

If you use the bundled skills, your coding agent will usually add this for you.

Provider extras

uv add "kensa[anthropic]"
uv add "kensa[openai]"
uv add "kensa[langchain]"
uv add "kensa[all]"

Core commands

Command	What it does
`kensa init --blank`	Scaffold `.kensa/` without example content
`kensa doctor`	Check instrumentation, config, and environment readiness
`kensa eval`	Run + judge + report in one command
`kensa report`	Show the latest results in terminal, Markdown, JSON, or HTML
`kensa analyze`	Flag slow, expensive, flaky, or error-prone traces

Manual workflow

If you want to author evals yourself:

kensa init --blank
kensa doctor

Scenarios live in .kensa/scenarios/*.yaml and point at your agent entrypoint with run_command.

id: classify_ticket
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: python agent.py {{input}}

checks:
  - type: output_matches
    params: { pattern: "^P[123]$" }

criteria: |
  P1 is for outages or data loss affecting multiple users.

For complete examples, see examples/.

CI

- name: Run evals
  run: uv run kensa eval --format markdown

If you only use deterministic checks, you do not need API keys. If you use criteria or judge, add judge provider secrets in CI.

Need more?

Docs
examples/ has sample agents and scenarios
CONTRIBUTING.md covers local development
Homepage
Issues
MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.claude-plugin		.claude-plugin
.github		.github
assets		assets
examples		examples
scripts		scripts
skills		skills
src/kensa		src/kensa
tests		tests
website		website
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
cliff.toml		cliff.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Skills + CLI (recommended)

Claude Code plugin

Quickstart

If instrumentation is missing

Core commands

Manual workflow

CI

Need more?

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Skills + CLI (recommended)

Claude Code plugin

Quickstart

If instrumentation is missing

Core commands

Manual workflow

CI

Need more?

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages