Agent Eval Harness

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

⭐ Star this repo to bookmark — fresh data every 15 minutes

English · 中文 · 日本語 · 한국어 · Español · Português

💡 What is this?

A standardized benchmark suite that runs coding agents against live, real-world GitHub issues with reproduction steps. Unlike static academic benchmarks, it outputs a weekly-updated public leaderboard, enabling developers to compare agents like OpenCode, Codex, and Claude Code in realistic scenarios.

This list is auto-updated every 15 minutes by a GitHub Actions cron. Each commit reflects a real change in the upstream data source — new items added, expired items removed — so you can rely on what you see being current.

📋 Current Items

⏰ Last updated: 2026-06-16 09:45 UTC

Data source: GitHub Search API

The table below is rewritten on every cron tick. Star the repo to bookmark.

#	Name	⭐	Lang	Updated	Description
1	truera/trulens	3384	Python	2026-06-16	Evaluation and Tracking for LLM Experiments and AI Agents
2	jedobe/skill-evaluator	0	Python	2026-06-16	Score any Claude Code skill against a research-backed rubric derived from the top 9 most-starred skill repos on GitHub
3	saddled-panicattack529/idea-evaluation-pipeline	0	—	2026-06-16	Streamline research idea evaluation for finance and economics to reach top journal quality using an iterative, AI-assist
4	Kondwani10/Origin-Continuum	0	—	2026-06-16	🌐 Define and explore the Origin ↔ Continuum framework, ensuring proper attribution and continuity in dependency relation
5	Kamixon131/claude-config	1	—	2026-06-16	⚙️ Enhance Claude Code with a powerful configuration framework that features specialized agents and workflows for effici
6	Sans-cell-art/-Project-Phoenix-The-E-Waste-Supercomputer-	0	—	2026-06-16	♻️ Transform e-waste into a powerful, low-cost cloud operating system, unlocking computing potential and promoting resou
7	bhavya7995/AI_governance	1	PowerShell	2026-06-16	🤖 Streamline AI-assisted development with a governance kit for rules, enforcement, and decision-making, ensuring speed a
8	Phinchanbora/llm-evaluation	0	Python	2026-06-16	🎯 Benchmark LLMs effectively with over 10 tests and 108,000 real questions to assess model performance and enhance AI ev
9	penpoen/llm-SugarScape	1	Python	2026-06-16	🌐 Explore AI behaviors in a Sugarscape simulation, revealing insights into cooperation and survival instincts using Grok
10	Arize-ai/phoenix	10162	Python	2026-06-16	AI Observability & Evaluation
11	promptfoo/promptfoo	22262	TypeScript	2026-06-16	Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C
12	multivon-ai/multivon-eval	7	Python	2026-06-16	Practical LLM evaluation for teams that ship to production. Deterministic + LLM-as-judge evaluators, dataset support, CI
13	verifywise-ai/verifywise	306	TypeScript	2026-06-16	Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI framewo
14	IonDen/mlx-quant-fidelity	0	Python	2026-06-15	Measure MLX quantization quality loss — KL divergence, perplexity, top-token agreement for KV cache and weights
15	jeremylongshore/j-rig-skill-binary-eval	0	TypeScript	2026-06-15	Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score e
16	anejakartik/evalstack	0	Python	2026-06-15	Open-source LLM evaluation framework — drop-in SDK + CI plugin. LLM-as-judge, regression detection, free + self-hostable
17	Giskard-AI/giskard-oss	5433	Python	2026-06-15	🐢 Open-Source Evaluation & Testing library for LLM Agents
18	thewonderofyou777z-dot/tjoe-reviewkit	0	Python	2026-06-15	TjoeReviewKit：tjoe 的本地离线工作流复盘检查工具；不运行任务、不联网、不接管工具调用、不采集生产日志
19	tushariitr-19/assay	2	Go	2026-06-15	Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.
20	ALEX-nlp/OpenSkillEval	11	Python	2026-06-15	OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
21	ahwurm/localshift	3	Python	2026-06-15	Migrate headless Claude/AI workloads to local LLMs with a derived, per-workload quality eval — cron job in, zero-margina
22	mpuodziukas-labs/eval-harness-template	0	Python	2026-06-14	Eval harness template for LLM systems: golden regression, LLM-as-judge, invariants
23	homemade-software-inc/completion-kit	1	Ruby	2026-06-15	Your prompts need tests too. Run prompts against real datasets, score outputs with LLM judges, version everything, and c
24	mizcausevic-dev/agent-eval-arena	0	TypeScript	2026-06-11	Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions,
25	ejentum/eval	3	Python	2026-06-11	A/B evaluate any LLM task with and without Ejentum cognitive injection. n8n workflow + TypeScript module.
26	NoesisVision/nasde-toolkit	10	Python	2026-06-10	CLI for benchmarks & evals of AI coding agents — on tasks you already understand, using your Claude / Codex / Gemini ind
27	akanjilal-work/agent-eval-harness	0	Python	2026-06-10	A lightweight harness to test agent behaviour (tool-call correctness, injection refusal, cost ceilings) before deploymen
28	karlmehta/trustmodel-mcp	0	TypeScript	2026-06-10	TrustModel MCP Server — trust evaluation, red-team & governance for AI agents via the Model Context Protocol. Public can
29	reaatech/agent-eval-harness	0	TypeScript	2026-06-15	End-to-end agent evaluation — trajectory eval, tool-use correctness, cost-per-task, latency budgets, regression suites w
30	alyssadata/continuity-keys	1	—	2026-06-08	Continuity Keys: tests for “same someone” returns. Behavioral identity consistency under pressure. Origin (Alyssa Solen)
31	melody-ling-L/eval-resume	0	HTML	2026-06-04	第一个聚焦"简历改写诚实度"的中文 LLM benchmark：20 真实脱敏简历 × 3 模型 × 4 评分维度
32	reaatech/classifier-evals	0	TypeScript	2026-06-10	Offline classifier evaluation harness — dataset loader, confusion matrices, LLM-as-judge with cost accounting, regressio
33	reaatech/rag-eval-pack	0	TypeScript	2026-06-15	RAG evaluation toolkit — faithfulness, answer relevance, context precision/recall, cost accounting, CI gates. Pairs with
34	Juanllenato/llm-eval-harness	0	Python	2026-06-03	A small, production-minded evaluation and observability harness for LLM/RAG features. Runs offline or live, gates CI on
35	Victor-David-Medina/llm-eval-harness	0	Python	2026-06-03	LLM evaluation harness that gates quality in CI: golden datasets, regression detection, grounding and faithfulness check
36	harnexa/nexa-gauge	38	Python	2026-06-15	An graph-eval framework for LLM's
37	thestio/thest-eval	0	Python	2026-06-02	The CI regression gate and governance-evidence layer for LLM systems — zero-dependency, vendor-neutral, offline.
38	pdxlab/trustmodel-mcp-server	0	TypeScript	2026-06-16	TrustModel MCP Server — trust evaluation, red-team, and governance for AI agents via the Model Context Protocol. npm: @t
39	monkeyin92/voice-agent-testops	0	TypeScript	2026-06-01	Regression testing for voice agents: scripted conversations, safety assertions, CI-ready reports.
40	fastxyz/skill-optimizer	65	TypeScript	2026-05-28	Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs
41	ajmeese7/local-llms	1	Python	2026-05-27	Use local Large Language Models for production use cases, and perform benchmarking for task-specific performance evaluat
42	rogue-socket/focusgroup	0	Python	2026-05-27	Persona-driven dynamic testing for conversational AI products. Focus groups for your agents.
43	chquandogong/mission-spec	0	TypeScript	2026-06-13	Mission Spec — AI 에이전트 워크플로를 위한 task contract layer
44	sanya2025/edututor-eval	0	Python	2026-05-21	A lightweight evaluation framework for AI tutoring responses, built for education-focused LLM systems
45	Alexanderk30/context-override-resistance	0	Python	2026-05-19	RL-style eval measuring intent/action divergence in frontier agents: model acknowledges a correction, then acts on the s
46	melody-ling-L/judgebuddy	0	HTML	2026-05-19	Single-file labeling tool for LLM-as-judge calibration. Three-pane comparison + multi-dim scoring. Zero deployment.
47	GiuseppeSp/n8n-customer-interview-synthesizer	0	—	2026-05-19	Multi-agent customer-interview synthesis pipeline in n8n with LLM-as-judge eval, Slack human-in-the-loop approval, and d
48	gmitt98/fieldtest	0	Python	2026-05-16	LLM evaluation framework — define what correct, well-formed, and safe means before you measure
49	verifywise-ai/plugin-marketplace	3	TypeScript	2026-05-15	VerifyWise AI Governance Plugin Marketplace
50	AI-QL/tuui	1148	TypeScript	2026-05-14	A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context
51	prompt-foundry/typescript-sdk	6	TypeScript	2026-05-13	The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
52	prompt-foundry/python-sdk	8	Python	2026-05-13	The prompt engineering, prompt management, and prompt evaluation tool for Python
53	Ruthwik-Data/mechanictrust	0	—	2026-05-11	AI product case study for trust, pricing transparency, and explainable diagnosis in auto repair.
54	SAY-5/eval-observability	0	Python	2026-05-10	Python LLM eval framework with full OTel tracing, structured logs, and daily Welch's-t-test regression detection persist
55	Ruthwik-Data/finrag-eval	0	Python	2026-05-10	RAG eval pipeline on Apple's FY 2024 10-K — found confident hallucinations, filed a metric-level bug in DeepEval, and bu
56	Ruthwik-Data/self-improving-prompt-agent	0	Python	2026-05-10	Prompt optimization loop that improves prompts through iterative mutation and LLM-as-judge evaluation. Score went 0.10 →
57	SAY-5/genai-eval	0	Python	2026-05-07	Multilingual GenAI evaluation service across 5 task types and 3 languages, with regression-trend dashboard
58	HumphreySun98/repoagentbench	32	Python	2026-04-30	SWE-bench for your codebase — mine your merged PRs into local, contamination-free coding-agent benchmarks. Adapters: cla
59	YagneshKhamar/phasio	0	TypeScript	2026-04-29	Jest-style testing for LLM prompts. Version prompts, run evals across OpenAI and Anthropic, catch regressions in CI.
60	lehigh-university-libraries/htr	2	Go	2026-06-03	Handwritten Text Recognition llm eval tool
61	JSLEEKR/evaltrack	0	TypeScript	2026-04-24	Local-first regression and trend CLI for promptfoo eval histories — the git log + git diff for LLM eval outputs.
62	izam-mohammed/ragrank	47	Python	2026-04-21	🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, an
63	arthursoares/openclaw-llm-bench	2	Python	2026-04-11	A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-j
64	YuanyangLiNEU/mini-claude	0	TypeScript	2026-04-11	A minimal Claude Code built from scratch — agent loop, tool calling, web search, permissions, and a black-box LLM eval h
65	webrenew/models-dilemma	4	TypeScript	2026-04-08	The Prisoner's Dilemma played by LLMs
66	AdirAmsalem/openclaw-eval	0	Python	2026-03-31	Compare OpenClaw setups against the same scenario suite. Run prompts across multiple configurations, capture answers, la
67	Data-ScienceTech/forcefield	1	Python	2026-03-30	ForceField Python SDK -- AI security in 3 lines of code. Prompt injection detection, PII redaction, security evals, tool
68	klausners/prompt-optimizer	0	TypeScript	2026-03-26	Config-driven CLI that runs promptfoo evals, identifies low-scoring prompts, rewrites them via Claude API, and re-evalua
69	Aysnc-Labs/llm-eval	1	PHP	2026-03-20	A PHP package for evaluating LLM outputs. Test your prompts, validate responses, and ensure your AI features work correc
70	asarnaout/veritail	6	Python	2026-03-15	LLM-as-a-Judge evaluation platform for ecommerce search. Scores relevance, computes IR metrics, and flags quality issues
71	vola-trebla/llm-infrastructure	0	—	2026-03-14	Full-stack AI infrastructure - 5 projects from data ingestion to autonomous agents
72	whitecircle/circle-guard-bench	70	Python	2026-03-07	First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (g
73	tpertner/squeeze	5	Python	2026-03-01	Squeeze your model with pressure prompts to see if its behavior leaks.
74	grigio/llm-eval-simple	69	Python	2026-02-28	llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection
75	QuesmaOrg/BinaryAudit	92	Shell	2026-02-27	An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
76	paradime-io/dbt-llm-evals	28	Python	2026-02-10	The warehouse-native LLM evaluation package for dbt™ - monitor AI quality without data egress
77	Striveworks/valor	41	Python	2026-02-09	Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models.
78	TADSTech/llm-output-grader	0	Python	2026-01-24	systematic llm grading
79	3ahmood/Agentic-Author-CrewAI	1	Jupyter Notebook	2026-01-15	On device autonomous research and content writing using open-sourced LLMs and Crew AI.
80	Supahands/llm-comparison-backend	22	Python	2026-01-13	This is an opensource project allowing you to compare two LLM's head to head with a given prompt, this section will be r
81	thedataquarry/structured-outputs	28	Python	2025-12-23	Structured output benchmarks comparing DSPy and BAML with different LLMs
82	higuseonhye/worldsim-eval	0	—	2025-12-20	Evaluate AI agents by simulating world-level consequences.
83	yukincom/llm-SugarScape	6	Python	2025-11-28	Multi-agent simulation using LLMs. Agents autonomously decide actions for survival, reproduction, and social behavior in
84	IAAR-Shanghai/GuessArena	10	Python	2025-11-15	[ACL 2025] GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Re
85	artefactop/promptdev	2	Python	2025-09-22	A prompt evaluation framework that provides comprehensive testing for AI agents across multiple providers.
86	multinear/multinear	45	Python	2025-09-02	Develop reliable AI apps
87	attogram/ollama-multirun	16	Shell	2025-08-30	Run a prompt against all, or some, of your models running on Ollama. Creates web pages with the output, performance stat
88	khoj-ai/llm-coup	13	TypeScript	2025-08-18	Let LLMs play coup with each other and see who's the best at deception & strategy
89	jaaack-wang/multi-problem-eval-llm	3	Jupyter Notebook	2025-08-08	Evaluating LLMs with Multiple Problems at once: A New Paradigm for Probing LLM Capabilities
90	alan-turing-institute/prompto	38	Python	2025-07-18	An open source library for asynchronous querying of LLM endpoints
91	athina-ai/athina-evals	300	Python	2025-06-06	Python SDK for running evaluations on LLM generated responses
92	amplifying-ai/ai-product-bench	23	HTML	2025-05-27
93	regankight/mirror-model-eval-tests	0	—	2025-05-17	LLM behavior QA: tone collapse, false consent, and reroute logic scoring.
94	pyladiesams/eval-llm-based-apps-jan2025	8	Jupyter Notebook	2025-05-06	Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundatio
95	daqh/llm-eval	0	Python	2025-03-24	This project applies the LLM-Eval framework to the PersonaChat dataset to assess response quality in a conversational co
96	parea-ai/parea-sdk-py	82	Python	2025-02-13	Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
97	parea-ai/parea-sdk-ts	4	TypeScript	2025-01-17	TypeScript SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)
98	yukinagae/genkitx-promptfoo	7	TypeScript	2025-01-03	Community Plugin for Genkit to use Promptfoo
99	honeyhiveai/realign	19	Python	2024-12-04	Realign is a testing and simulation framework for AI applications.
100	harlev/eva-l	5	Python	2024-11-27	LLM Evaluation Framework

🔍 How it works

Every 15 minutes, a GitHub Action runs tracker.py. That script:

Fetches the latest state from GitHub Search API.
Diffs against data/items.json (the previous snapshot).
Rewrites the table above between the  markers.
Commits feat: +N added, -M removed (timestamp) if anything changed.

No external services. No paid APIs. Just a public data source and a free GitHub Action.

🤝 Contributing

See CONTRIBUTING.md — usually you don't need to: the tracker keeps itself current. If you spot a data-source bug or want to suggest a new column for the table, open an issue.

🔗 Related live trackers

If you find this useful, you might also like these other auto-updated trackers from the same maintainer — same mechanism, different upstream:

trending-claude-skills — What's shipping in Claude Skills this week (topic:claude-skills)
mcp-servers-live — Live index of newest MCP servers (topic:mcp-server)
cursor-rules-live — Newest Cursor rules and .cursorrules patterns (topic:cursor-rules)
claude-code-plugin-tracker — Claude Code plugins and hook configs (topic:claude-code)
llm-agents-radar — Newest LLM agent frameworks (topic:llm-agent)
rag-radar — Newest RAG implementations and tools (topic:rag)
llm-eval-tracker — Newest LLM evaluation tools and benchmarks (topic:llm-eval)
agent-framework-radar — Newest agent frameworks shipping on GitHub (topic:agent-framework)
vector-db-live — Newest vector DB projects and integrations (topic:vector-database)
llmops-radar — Newest LLMOps tooling (observability, deployment) (topic:llmops)
prompt-tools-live — Newest prompt-engineering tools and prompt repos (topic:prompt-engineering)
skills-tracker — Tracking new GitHub 'skills' repos (topic:agent-skills)
awesome-agent-skills — Curated auto-updated awesome-list of AI agent skills (topic:agent-skills)

📜 License

MIT — see LICENSE.

More from linny006

Awesome Agent Skills — Curated, auto-updated awesome-list of vetted AI agent skills with quality ratings for Claude, GPT, and open-source agents (⭐ 0)
Agent Skills Daily Tracker — Real-time tracking of every new GitHub 'skills' repo to capture the AI agent skill ecosystem trend (⭐ 0)
Agent Eval Harness — Live, open-source benchmark for comparing AI coding agents on real GitHub issues (⭐ 0)
Prompt Tools Live — Live-updating tracker of prompt engineering tools, libraries, and techniques — refreshed every 15 minutes (⭐ 0)
LLMOps Radar — Live index of the newest LLMOps tooling — track what's shipping in LLM observability and deployment (⭐ 0)

Name		Name	Last commit message	Last commit date
Latest commit History 2,395 Commits
.github/workflows		.github/workflows
data		data
README.md		README.md
README_CN.md		README_CN.md
README_ES.md		README_ES.md
README_JA.md		README_JA.md
README_KO.md		README_KO.md
README_PT.md		README_PT.md
requirements.txt		requirements.txt
tracker.py		tracker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Eval Harness

💡 What is this?

📋 Current Items

🔍 How it works

🤝 Contributing

🔗 Related live trackers

📜 License

More from linny006

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Eval Harness

💡 What is this?

📋 Current Items

🔍 How it works

🤝 Contributing

🔗 Related live trackers

📜 License

More from linny006

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages