AI evaluation is a critical component in building reliable, impactful models. We’re excited to introduce judges
, an open-source library of LLM-as-a-judge evaluators designed to help bootstrap your evaluation process. Complementing judges is autojudge
, an extension that automatically creates evaluators aligned with human feedback.
judges
: Research-Backed LLM-as-a-Judge
judges
provides a curated collection of evaluators, backed by published research, to help jumpstart your LLM evaluation process. These LLM-as-a-judge evaluators can be used either out-of-the-box or as a foundation for your specific needs.
Key Features of judges:
- Curated, Research-Backed LLM-as-a-judge Prompts: Every judge prompt is thoughtfully designed based on cutting-edge research and curated to ensure high-quality evaluations.
- Juries: A jury of LLMs enables more diverse results by combining judgments from multiple LLMs.
- Flexible Model Integration: Compatible with both open-source and closed-source models through OpenAI and LiteLLM integrations.
- Human-Aligned Evaluators: autojudge automatically builds human-aligned LLM-as-a-judge prompts from small labeled datasets.
Getting started with judges
Install the library:
pip install judges
Pick a model:
- OpenAI
- By default,
judges
uses the OpenAI client and models. To get started, you’ll need an OpenAI API key set as an environment variable OPENAI_API_KEY
- By default,
- LiteLLM:
judges
also integrates withlitellm
to allow access to most other models. Runpip install judges[litellm]
, and set the appropriate API keys based on the LiteLLM Docs.
Send data to an LLM:
from openai import OpenAI
client = OpenAI()
question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."
story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.
One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.
But instead of chasing it, Fig barked in excitement, as if saying, “Nice to meet you!” The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.
From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""
# set up the input prompt
input = f'{story}\n\nQuestion:{question}'
# write down what the model is expected to respond with
# NOTE: not all judges require an expected answer. refer to the implementations
expected = "I don't know"
# get the model output
output = client.chat.completions.create(
model='gpt-4o-mini',
messages=[
{
'role': 'user',
'content': input,
},
],
).choices[0].message.content
Use a judges
classifier LLM as an evaluator model:
from judges.classifiers.correctness import PollMultihopCorrectness
# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='gpt-4o-mini')
judgment = correctness.judge(
input=input,
output=output,
expected=expected,
)
print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.
print(judgment.score)
# True
autojudge
: Automating Human-Aligned Evaluations
While judges
provides ready-to-use evaluators, autojudge extends this functionality by automating evaluator creation. Given a labeled dataset with feedback and a natural language description of an evaluation task, it generates grading notes for an evaluator prompt, streamlining the process of building new evaluators.
How autojudge Works:
Install the library extension:
pip install "judges[auto]"
Prepare your dataset: Your dataset can be either a list of dictionaries or path to a csv file with the following fields:
- input: The input provided to your model
- output: The model’s response
- label: 1 for correct, 0 for incorrect
- feedback: Feedback explaining why the response is correct or incorrect
dataset = [
{
"input": "What's the best time to visit Paris?",
"output": "The best time to visit Paris is during the spring or fall.",
"label": 1,
"feedback": "Provides accurate and detailed advice."
},
{
"input": "Can I ride a dragon in Scotland?",
"output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training",
"label": 0,
"feedback": "Dragons are mythical creatures; the information is fictional."
}
]
Initialize your autojudge:
from judges.classifiers.auto import AutoJudge
dataset = [
{
"input": "Can I ride a dragon in Scotland?",
"output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
"label": 0,
"feedback": "Dragons are mythical creatures; the information is fictional.",
},
{
"input": "Can you recommend a good hotel in Tokyo?",
"output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
"label": 1,
"feedback": "Offers a specific and helpful recommendation.",
},
{
"input": "Can I drink tap water in London?",
"output": "Yes, tap water in London is safe to drink and meets high quality standards.",
"label": 1,
"feedback": "Gives clear and reassuring information.",
},
{
"input": "What's the boiling point of water on the moon?",
"output": "The boiling point of water on the moon is 100°C, the same as on Earth.",
"label": 0,
"feedback": "Boiling point varies with pressure; the moon's vacuum affects it.",
}
]
# Task description
task = "Evaluate responses for accuracy, clarity, and helpfulness."
# Initialize autojudge
autojudge = AutoJudge.from_dataset(
dataset=dataset,
task=task,
model="gpt-4-turbo-2024-04-09",
# increase workers for speed ⚡
# max_workers=2,
# generated prompts are automatically saved to disk
# save_to_disk=False,
)
Use your judge to evaluate new input-output pairs:
# Input-output pair to evaluate
input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."
# Get the judgment
judgment = autojudge.judge(input=input_, output=output)
# Print the judgment
print(judgment.reasoning)
# The response accurately lists popular attractions like the Statue of Liberty and Central Park, which are well-known and relevant to the user's query.
print(judgment.score)
# True (correct)
Why autojudge
Matters:
Human-aligned evaluations are essential for developing models that meet user expectations. autojudge
provides a seamless way to automate high-quality evaluations and integrate them into development pipelines.
## How judge
s and autojudge
Work Together
Together, judges
and autojudge
provide a comprehensive framework for boostrapping AI evaluation:
judges
provides a research-backed foundation of ready-to-use LLM-as-a-judge evaluators.autojudge
automates the creation of new human-aligned evaluators, enabling scalable and consistent assessments across diverse tasks.
This combination helps AI developers and researchers quickly kick-off their evaluation process and scale it for real-world applications.
Getting Started
Ready? Here’s how to begin:
- Explore
judges
: Visit the GitHub repository to learn more about using LLM-as-a-judge evaluators. - Experiment with
autojudge:
Useautojudge
to create scalable, human-aligned evaluations that fit your workflow. - Join the community: Have LLM-as-a-judge evaluators you’d like to contribute? Consider making a pull request and helping expand our collection of evaluation tools.