This repository contains custom scorers designed to work with the Inspect AI framework, which is a Python repository for evaluating the performance of language models on various tasks.
The scorers in this repository provide two different evaluation mechanisms:
-
Fact Comparator Scorer: This scorer evaluates the overlap between facts in a predefined target text and the actual output generated by a language model in response to a given input or question. It calculates two metrics: groundedness and thoroughness. The
FactComparatorclass ininspect_ai_scorers.fact_comparatoris responsible for this evaluation. It uses a series of prompts to parse the input and target texts into lists of individual facts, and then compares these fact lists to determine the shared and unique facts. The groundedness metric measures the percentage of facts in the model's output that are present in the target text, while the thoroughness metric measures the percentage of facts in the target text that are present in the model's output. Each of these will be between 0 and 1. -
Prompt Evaluator: This scorer allows you to define a rubric in the target field, which is then used to evaluate the model's output. The rubric specifies whether the output should be considered a PASS or a FAIL based on certain criteria. The
PromptEvaluatorclass ininspect_ai_scorers.prompt_evaluatorimplements this functionality. It takes the target text, which should contain instructions like "Return PASS if the answer contains that the sun is 4.6 billion years old, return FAIL otherwise," and the model's output. It then evaluates the output based on the provided criteria and returns a score of 1 (PASS) or 0 (FAIL).
Preliminary unit tests for the FactComparator and PromptEvaluator classes are provided in the tests directory. These don't test whether the scorers are doing a good job, they just test whether they can run in conjunction with a Task which queries a model, returns a response, and then evaluates that response with the scorer. Here, the input field is used for the question which is then being passed to a model to get the response.
The examples directory contains demonstrations of how to use the custom scorers in different scenarios. There are two main types of examples:
These examples demonstrate the use of the scorers directly on pre-defined inputs and targets, without involving a language model for generating responses. They are useful for testing the scorers themselves and understanding how they evaluate different types of inputs.
-
Fact Comparator Examples (
fact_comparator_examples.py): This script demonstrates the use of theFactComparatorscorer on a variety of pre-defined input-target pairs. It shows how the scorer evaluates groundedness and thoroughness for different scenarios, such as partial fact overlap, restructured information, and incorrect information. -
Prompt Evaluator Examples (
prompt_evaluator_examples.py): This script showcases thePromptEvaluatorscorer, demonstrating how it evaluates inputs based on specific criteria defined in the target text. It includes examples of both PASS and FAIL scenarios for various types of questions and conditions.
To run these examples, use:
python examples/fact_comparator_examples.py --model <model_name>
python examples/prompt_evaluator_examples.py --model <model_name>
Replace <model_name> with the desired model for evaluation (e.g., 'openai/gpt-4').
The full task examples demonstrate how to use the scorers as part of a complete evaluation pipeline, where a language model generates responses to questions, and these responses are then evaluated using our custom scorers.
Full Task Example (full_task_examples.py):
This script shows how to set up and run a complete evaluation task using both the FactComparator and PromptEvaluator scorers. It demonstrates:
- How to define evaluation tasks using the
@taskdecorator - How to specify different models for querying (generating responses) and evaluation (scoring the responses)
- How to run the evaluation
To run the full task example, use:
python examples/full_task_examples.py --eval_model <eval_model_name> --query_model <query_model_name>
Replace <eval_model_name> with the model you want to use for evaluation (e.g., 'openai/gpt-3.5-turbo') and <query_model_name> with the model you want to use for generating responses (e.g., 'openai/gpt-4').
This project is licensed under the MIT License.
To use these scorers, install the inspect-ai-scorers package from Test PyPI:
pip install --index-url https://test.pypi.org/simple/ inspect-ai-scorers
To use the tests or examples, clone the repo:
git clone https://github.com/abigailhaddad/inspect_ai_eval