Skip to content

Latest commit

 

History

History
83 lines (62 loc) · 1.51 KB

File metadata and controls

83 lines (62 loc) · 1.51 KB

Quick Start Guide

Installation

# Install dependencies
pip install -r requirements.txt

Run Without API Keys (Demo Mode)

# See example test cases
python demo.py

# Dry run to see what would be tested
python benchmark.py --no-api --num-samples 10

Run Tests

# Run unit tests
python test_benchmark.py

Run Benchmark (Requires API Keys)

  1. Copy the example environment file:
cp .env.example .env
  1. Edit .env and add your API key(s)

  2. Run the benchmark:

# With OpenAI
python benchmark.py --provider openai --num-samples 20

# With Anthropic
python benchmark.py --provider anthropic --num-samples 20

# With specific model
python benchmark.py --provider openai --model gpt-4 --num-samples 10

Example Commands

# Quick test with 5 samples
python benchmark.py --num-samples 5

# Test only words with 4+ letter occurrences
python benchmark.py --min-letter-count 4 --num-samples 15

# Test longer words (10-20 characters)
python benchmark.py --min-word-length 10 --max-word-length 20 --num-samples 10

# Dry run to preview test cases
python benchmark.py --no-api --min-letter-count 4

Understanding Results

Results are saved to results/benchmark_results_TIMESTAMP.json with:

  • Each test case and its result
  • LLM's response
  • Whether it was correct
  • Summary statistics

Example result:

{
  "word": "strawberry",
  "letter": "R",
  "expected_count": 3,
  "llm_count": 2,
  "correct": false,
  "response": "2",
  "model": "gpt-3.5-turbo"
}