LLM as a judge
How to create and evaluate an LLM judge.
In this tutorial, we'll show you how to build an evaluator for text outputs using another LLM as the judge. This lets you automatically assess the quality of your system's responses based on your custom criteria.
We'll explore two ways to use an LLM as a judge:
Reference-based. Compare new responses against a reference. This is for regression testing workflows or whenever you have a "ground truth" or approved responses to compare against.
Open-ended. Evaluate responses based on custom criteria, which helps evaluate new outputs when there's no reference available.
By the end, you'll know how to create custom LLM judges and apply them to your data. Our primary focus will be showing how to develop and tune the evaluator, which you can then apply in different contexts, like regression testing or prompt comparison.
Tutorial scope
Here's what we'll do:
Create an evaluation dataset. Create a toy Q&A dataset with two responses to each question, and add manual labels based on the criteria we want the LLM evaluator to follow later.
Create and run an LLM as a judge. Design an LLM evaluator prompt to determine whether the new response is correct compared to the reference.
Evaluate the judge. Compare the LLM judge's evaluations with manual labels to see if they meet the expectations or need tweaking.
We'll start with the reference-based evaluator, which is more complex because it requires passing two columns to the prompt. Then, we'll create a simpler judge focused on verbosity.
To complete the tutorial, you will need:
Basic Python knowledge.
An OpenAI API key to use for LLM evaluator.
Use the provided code snippets or run a sample notebook.
Jupyter notebook:
Or click to open in Colab.
We recommend running this tutorial in Jupyter Notebook or Google Colab to render rich HTML objects with summary results directly in a notebook cell.
We will work with a toy dataset, which you can replace with your production data.
Installation and Imports
Install Evidently:
!pip install evidently[llm]
Import the required modules:
from evidently import ColumnMapping
from evidently.report import Report
from evidently.metric_preset import TextEvals
from evidently.descriptors import *
from evidently.metrics import *
from evidently.features.llm_judge import BinaryClassificationPromptTemplate
Pass your OpenAI key:
import os
os.environ["OPENAI_API_KEY"] = "YOUR_KEY"
1. Create the Dataset
First, we'll create a toy Q&A dataset that includes:
Questions. The inputs our LLM system got.
Target responses. The "approved" responses. You can curate these from previous outputs that you consider accurate.
New responses. These are the responses generated by your system that we want to evaluate.
To make it more interesting, we created a synthetic dataset with 15 answers to customer support questions. We also manually labeled each new response as correct or incorrect, with brief comments explaining the decision.
Here's how you can create this dataset in one go:
To preview it:
pd.set_option('display.max_colwidth', None)
golden_dataset.head(5)

Why start with manual labels? This process helps you:
Refine your criteria. Manually labeling data helps you clarify what you want the LLM judge to detect. It also reveals edge cases so that you can craft more effective evaluator prompts.
Evaluate the judge's quality. Manual labels serve as the ground truth. You can then compare the LLM's judgments with these labels to assess its accuracy.
Ultimately, an LLM judge is a small ML system, and it needs its own evals!
Here's the distribution of examples in our small dataset: we have both correct and incorrect responses.

2. Correctness evaluator
Now that we have our labeled dataset, it's time to set up an LLM judge. We'll start with an evaluator that checks if responses are correct compared to the reference. The goal is to match the quality of our manual labels.
We'll use the LLMEval
Descriptor to create a custom binary evaluator. Here's how to define the prompt template for correctness:
correctness_eval= LLMEval(
subcolumn="category",
additional_columns={"target_response": "target_response"},
template = BinaryClassificationPromptTemplate(
criteria = """
An ANSWER is correct when it is the same as the REFERENCE in all facts and details, even if worded differently.
The ANSWER is incorrect if it contradicts the REFERENCE, adds additional claims, omits or changes details.
REFERENCE:
=====
{target_response}
=====
""",
target_category="incorrect",
non_target_category="correct",
uncertainty="unknown",
include_reasoning=True,
pre_messages=[("system", "You are an expert evaluator. will be given an ANSWER and REFERENCE.")],
),
provider = "openai",
model = "gpt-4o-mini",
display_name = "Correctness",
)
Explanation:
BinaryClassificationPromptTemplate
: This template instructs the LLM to classify the input into two classes, explain its reasoning, and format everything neatly. You don't have to worry about asking for these details—they're built into the template.target_category
andnon-target category
: The labels we're aiming for - "correct" and "incorrect" in our case.criteria
: This is where you describe what the LLM should look for when grading the responses.include_reasoning
: This asks the LLM to explain its choice.additional_columns
: This allows you to include not just the primary column (the "new_response") but also the "reference_response" for comparison. You then add this column name placeholder to the grading criteria.
In this example, we've set up the prompt to be strict, erring on the side of labeling a correct answer as incorrect is preferable. You can write it differently. This flexibility is one of the key benefits of creating a custom judge.
3. Run the evaluation
Now, let's run the evaluation. We'll apply it to the "new_response" column in our dataset and create a report that summarizes how the LLM judged the responses.
correctness_report = Report(metrics=[
TextEvals(column_name="new_response", descriptors=[
correctness_eval
])
])
correctness_report.run(reference_data=None,
current_data=golden_dataset)
correctness_report
This will render an HTML report in the notebook cell. Or, use as_dict()
for a Python dictionary output.

But since we're refining our LLM evaluator, we don't want just the label distribution: we want to see what the LLM got right and wrong!
4. Evaluate the LLM Eval quality
This part is a bit meta: we're going to evaluate the quality of our LLM evaluator itself.
To take a look at the raw outputs:
correctness_report.datasets().current
This will show a DataFrame with newly added scores and explanations.

Note: your results and explanations will vary since LLMs are non-deterministic.
We can also quantify it! We'll treat this like a classification task to measure how accurately the LLM identifies incorrect responses. We'll look at metrics like precision and recall.
Let's create a DataFrame and map our data for classification: the original manual label is the target, and the LLM-provided response is the prediction.
df = pd.DataFrame(correctness_report.datasets().current)
column_mapping = ColumnMapping()
column_mapping.target = 'label'
column_mapping.prediction = 'Correctness category'
column_mapping.pos_label = 'incorrect'
classification_report = Report(metrics=[
ClassificationQualityMetric(),
ClassificationClassBalance(),
ClassificationConfusionMatrix(),
])
classification_report.run(reference_data=None, current_data=df, column_mapping=column_mapping)
classification_report
Or use classification_report.as_dict()
.
Explanation:
ClassificationQualityMetric
displays precision, recall, accuracy, etc.ClassificationClassBalance
shows the distribution of classes (correct vs. incorrect) in the dataset.ClassificationConfusionMatrix
illustrates the types of errors.
We have one type of error each, but overall, the results are pretty good! If you want to refine the judge, you can iterate on the prompt and continue improving it.

5. Verbosity evaluator
Next, let’s create a simpler LLM judge that evaluates the verbosity of the responses. This judge will check whether the responses are concise and to the point. This only requires evaluating one column with the output.
This is perfect for production evaluations where you don’t have a reference answer to compare against.
Here's how to set up the prompt template for verbosity:
verbosity_eval = LLMEval(
subcolumn="category",
template = BinaryClassificationPromptTemplate(
criteria = """Conciseness refers to the quality of being brief and to the point, while still providing all necessary information.
A concise response should:
- Provide the necessary information without unnecessary details or repetition.
- Be brief yet comprehensive enough to address the query.
- Use simple and direct language to convey the message effectively.
""",
target_category="concise",
non_target_category="verbose",
uncertainty="unknown",
include_reasoning=True,
pre_messages=[("system", "You are a judge which evaluates text.")],
),
provider = "openai",
model = "gpt-4o-mini",
display_name = "verbosity",
)
Run the Report and view the summary results:
verbosity_report = Report(metrics=[
TextEvals(column_name="new_response", descriptors=[
verbosity_eval
])
])
verbosity_report.run(reference_data=None,
current_data=golden_dataset)
verbosity_report

Or use as_dict()
for a Python dictionary output.
To access the raw results:
verbosity_report.datasets().current
Preview:

Don't fully agree with the results? Use these labels as a starting point, and correct the decision where you see fit - now you've got your golden dataset! Next, iterate on your judge prompt.
The LLM judge itself is just one part of your overall evaluation framework. You can now integrate this evaluator into workflows, such as testing your LLM outputs after changing a prompt.
Last updated