# Install dependencies
pip install -r requirements.txt# See example test cases
python demo.py
# Dry run to see what would be tested
python benchmark.py --no-api --num-samples 10# Run unit tests
python test_benchmark.py- Copy the example environment file:
cp .env.example .env-
Edit
.envand add your API key(s) -
Run the benchmark:
# With OpenAI
python benchmark.py --provider openai --num-samples 20
# With Anthropic
python benchmark.py --provider anthropic --num-samples 20
# With specific model
python benchmark.py --provider openai --model gpt-4 --num-samples 10# Quick test with 5 samples
python benchmark.py --num-samples 5
# Test only words with 4+ letter occurrences
python benchmark.py --min-letter-count 4 --num-samples 15
# Test longer words (10-20 characters)
python benchmark.py --min-word-length 10 --max-word-length 20 --num-samples 10
# Dry run to preview test cases
python benchmark.py --no-api --min-letter-count 4Results are saved to results/benchmark_results_TIMESTAMP.json with:
- Each test case and its result
- LLM's response
- Whether it was correct
- Summary statistics
Example result:
{
"word": "strawberry",
"letter": "R",
"expected_count": 3,
"llm_count": 2,
"correct": false,
"response": "2",
"model": "gpt-3.5-turbo"
}