A powerful, local-first framework designed to compare, evaluate, and train AI models (OpenAI, Anthropic, Google, Open Router, LM Studio, Ollama). Measure accuracy, latency, and token usage while refining model behavior through advanced judge personas and iterative training loops.
- Multi-Model Evaluation: Run instructions against GPT-4, Claude 3, Gemini 2.0, Open Router, LM Studio, and Ollama simultaneously.
- LLM-as-a-Judge: Specialized iterative training system for judge personas.
- Iterative Training: Refine judge prompts based on human feedback until convergence (F1 ≥ 0.80).
- Metrics: Automated calculation of F1 Score, Precision, Recall, and Cohen's Kappa.
- Human-in-the-Loop: Mandatory human review for early iterations to ground AI judgment.
- Advanced Configuration: Full control over System Prompts and Temperature (0.0 - 2.0).
- Accuracy Rubrics:
- Exact Match: String-level identity check.
- Partial Credit: Keyword/concept detection.
- Semantic Similarity: LLM-based meaning alignment.
- Data Management:
- Templates: Save and rerun benchmarks easily.
- Bulk Actions: Batch delete and advanced filtering (Date, Rubric, Score).
- CSV Support: Upload training pairs for judge training.
- Modern Developer Experience:
- Astro 5 SSR: High-performance server-side rendering.
- Tailwind CSS 4 & DaisyUI 5: Beautiful, themeable interface (Silk, Luxury, Cupcake, Nord).
- SQLite (better-sqlite3): Fast, ACID-compliant local persistence with encrypted API keys.
├── .storybook/ # Component documentation and isolated testing
├── db/ # Database management
│ ├── migrations/ # SQL schema versioning
│ ├── init.js # DB initialization logic
│ └── schema.sql # Core schema (9+ tables for evaluations & training)
├── public/ # Static assets
├── specs/ # Detailed feature specifications and design docs
├── src/ # Application Source
│ ├── components/ # UI Components
│ │ ├── layout/ # Navbar, ThemeController, Breadcrumbs
│ │ ├── ui/ # Atom components (Button, Input, Badge, Card)
│ │ └── [Feature].astro # Specialized components (MetricCard, ConfusionMatrix)
│ ├── lib/ # Business Logic
│ │ ├── db/ # Database access layer (persona-db.ts, etc.)
│ │ ├── evaluation/ # Evaluator orchestration and API clients
│ │ ├── training/ # LLM-as-Judge training loop and prompt engineering
│ │ ├── validation/ # Zod/Manual validation schemas
│ │ └── utils/ # Encryption, formatting, and metrics helpers
│ ├── pages/ # Astro routes & API endpoints
│ │ ├── api/ # REST API implementation
│ │ ├── evaluations/ # Result details
│ │ └── personas/ # Judge training workflows
│ └── styles/ # Tailwind CSS 4 configuration and global styles
├── tests/ # Comprehensive Test Suite
│ ├── unit/ # Logic & Metrics testing (Vitest)
│ ├── integration/ # API & DB flow testing (Vitest)
│ └── e2e/ # Workflow testing (Playwright)
├── openapi.yml # Full REST API Specification
└── astro.config.mjs # Astro & Vite configuration
- Node.js: v22.0.0 or higher
- npm: v10.0.0 or higher
- API Keys: OpenAI, Anthropic, Google Gemini, or Open Router (for cloud providers)
- Local LLM: LM Studio or Ollama (for local evaluation, optional)
-
Clone and Install
git clone <repository-url> cd eval-ai-models npm install
-
Environment Configuration
cp .env.example .env # Generate a 32-byte hex key for API key encryption openssl rand -hex 32 # Add this to ENCRYPTION_KEY in .env
-
Initialize Database
npm run db:init
-
Run Development Server
npm run dev
The application will be available at http://localhost:3000.
The framework supports multiple AI providers through a unified evaluation interface:
| Provider | Models | API Key Format | Notes |
|---|---|---|---|
| OpenAI | GPT-4, GPT-4o, o1, o3 | sk-... |
Requires API key |
| Anthropic | Claude 3 Opus/Sonnet/Haiku | sk-ant-... |
Requires API key |
| Gemini 1.5/2.0 Pro | AIza... |
Requires API key (39+ chars) | |
| Open Router | Multi-provider access | sk-or-... |
Requires API key |
| Provider | Models | Auth | Default Endpoint | Notes |
|---|---|---|---|---|
| LM Studio | Local LLMs (Llama, Mistral, etc.) | None | http://localhost:1234/v1 |
No API key required |
| Ollama | Local LLMs (Llama, Mistral, etc.) | None | http://localhost:11434 |
No API key required |
- Navigate to Models page
- Click Add Model
- Select Provider from dropdown
- Enter Model Name (must match provider's model identifier)
- For cloud providers: Enter API Key
- For local providers: Optionally configure Base URL (uses default if omitted)
- Click Test to verify connection
- Click Add Model to save
Open Router: Access multiple AI models through a single API. Visit openrouter.ai to get your API key and browse available models.
LM Studio: Download from lmstudio.ai, start the server, and load your preferred model.
Ollama: Install from ollama.com, start the server with ollama serve, and pull models with ollama pull <model-name>.
The project uses a contract-first approach. The complete REST API documentation is maintained in: 👉 openapi.yml
Key API Modules:
/api/models: Model configuration and encryption./api/evaluate: Core evaluation execution./api/personas: Judge persona management and training iterations./api/templates: Reusable benchmark configurations.
Strict adherence to Constitution Principle II: Tests are written first for all critical paths.
npm test # Run unit and integration tests
npm run test:coverage # Verify >80% coverage on critical paths
npm run test:e2e # Run Playwright end-to-end tests
npm run typecheck # Verify TypeScript strict modeThis project uses automated quality enforcement to maintain code standards:
Local Development (Pre-commit Hooks):
- Pre-commit: Runs ESLint and Prettier on staged files (_.ts, _.tsx, _.astro, _.js, *.jsx)
- Typecheck is intentionally excluded from pre-commit for faster incremental development
- Run typecheck manually before creating PRs
- Pre-push: Runs full test suite (optional, can be skipped with
--no-verify) - Hooks are installed automatically via
npm install
Continuous Integration (GitHub Actions):
- Runs on every pull request and push to main
- Parallel jobs: Lint, Type Check, Test, Format Check
- All checks must pass before merging (configure branch protection rules)
Manual Verification:
npm run lint # ESLint check
npm run typecheck # TypeScript strict mode + Astro component check
npm run format # Prettier auto-format
npm test # Run full test suiteTo reinstall git hooks manually:
npm run prepare # Sets up git hooks via simple-git-hooksOr use the direct command:
npx simple-git-hooks # Install git hooks from package.json configWe use Storybook for component-driven development.
npm run storybookTo wipe the local database and start fresh:
npm run db:resetMIT