Inspect AI Custom Scorers

This repository contains custom scorers designed to work with the Inspect AI framework, which is a Python repository for evaluating the performance of language models on various tasks.

Overview

The scorers in this repository provide two different evaluation mechanisms:

Fact Comparator Scorer: This scorer evaluates the overlap between facts in a predefined target text and the actual output generated by a language model in response to a given input or question. It calculates two metrics: groundedness and thoroughness. The FactComparator class in inspect_ai_scorers.fact_comparator is responsible for this evaluation. It uses a series of prompts to parse the input and target texts into lists of individual facts, and then compares these fact lists to determine the shared and unique facts. The groundedness metric measures the percentage of facts in the model's output that are present in the target text, while the thoroughness metric measures the percentage of facts in the target text that are present in the model's output. Each of these will be between 0 and 1.
Prompt Evaluator: This scorer allows you to define a rubric in the target field, which is then used to evaluate the model's output. The rubric specifies whether the output should be considered a PASS or a FAIL based on certain criteria. The PromptEvaluator class in inspect_ai_scorers.prompt_evaluator implements this functionality. It takes the target text, which should contain instructions like "Return PASS if the answer contains that the sun is 4.6 billion years old, return FAIL otherwise," and the model's output. It then evaluates the output based on the provided criteria and returns a score of 1 (PASS) or 0 (FAIL).

Testing

Preliminary unit tests for the FactComparator and PromptEvaluator classes are provided in the tests directory. These don't test whether the scorers are doing a good job, they just test whether they can run in conjunction with a Task which queries a model, returns a response, and then evaluates that response with the scorer. Here, the input field is used for the question which is then being passed to a model to get the response.

Examples

The examples directory contains demonstrations of how to use the custom scorers in different scenarios. There are two main types of examples:

Scorer-Only Examples

These examples demonstrate the use of the scorers directly on pre-defined inputs and targets, without involving a language model for generating responses. They are useful for testing the scorers themselves and understanding how they evaluate different types of inputs.

Fact Comparator Examples (fact_comparator_examples.py): This script demonstrates the use of the FactComparator scorer on a variety of pre-defined input-target pairs. It shows how the scorer evaluates groundedness and thoroughness for different scenarios, such as partial fact overlap, restructured information, and incorrect information.
Prompt Evaluator Examples (prompt_evaluator_examples.py): This script showcases the PromptEvaluator scorer, demonstrating how it evaluates inputs based on specific criteria defined in the target text. It includes examples of both PASS and FAIL scenarios for various types of questions and conditions.

To run these examples, use:

python examples/fact_comparator_examples.py --model <model_name>
python examples/prompt_evaluator_examples.py --model <model_name>

Replace <model_name> with the desired model for evaluation (e.g., 'openai/gpt-4').

Full Task Examples

The full task examples demonstrate how to use the scorers as part of a complete evaluation pipeline, where a language model generates responses to questions, and these responses are then evaluated using our custom scorers.

Full Task Example (full_task_examples.py): This script shows how to set up and run a complete evaluation task using both the FactComparator and PromptEvaluator scorers. It demonstrates:

How to define evaluation tasks using the @task decorator
How to specify different models for querying (generating responses) and evaluation (scoring the responses)
How to run the evaluation

To run the full task example, use:

python examples/full_task_examples.py --eval_model <eval_model_name> --query_model <query_model_name>

Replace <eval_model_name> with the model you want to use for evaluation (e.g., 'openai/gpt-3.5-turbo') and <query_model_name> with the model you want to use for generating responses (e.g., 'openai/gpt-4').

License

This project is licensed under the MIT License.

Installation

To use these scorers, install the inspect-ai-scorers package from Test PyPI:

pip install --index-url https://test.pypi.org/simple/ inspect-ai-scorers

To use the tests or examples, clone the repo:

git clone https://github.com/abigailhaddad/inspect_ai_eval

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
inspect_ai_scorers		inspect_ai_scorers
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
make.bat		make.bat
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inspect AI Custom Scorers

Overview

Testing

Examples

Scorer-Only Examples

Full Task Examples

License

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Inspect AI Custom Scorers

Overview

Testing

Examples

Scorer-Only Examples

Full Task Examples

License

Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages