LLM as a judge

How to create and evaluate an LLM judge.

You are looking at the old Evidently documentation: this API is available with versions 0.6.7 or lower. Check the newer version here.

In this tutorial, we'll show you how to build an evaluator for text outputs using another LLM as the judge. This lets you automatically assess the quality of your system's responses based on your custom criteria.

You can also create LLM judges using no code. This tutorial shows an open-source workflow that you can run locally using the Evidently Python library. You can also create and run LLM judges on the platform using no-code interface.

We'll explore two ways to use an LLM as a judge:

  • Reference-based. Compare new responses against a reference. This is for regression testing workflows or whenever you have a "ground truth" or approved responses to compare against.

  • Open-ended. Evaluate responses based on custom criteria, which helps evaluate new outputs when there's no reference available.

By the end, you'll know how to create custom LLM judges and apply them to your data. Our primary focus will be showing how to develop and tune the evaluator, which you can then apply in different contexts, like regression testing or prompt comparison.

Tutorial scope

Here's what we'll do:

  • Create an evaluation dataset. Create a toy Q&A dataset with two responses to each question, and add manual labels based on the criteria we want the LLM evaluator to follow later.

  • Create and run an LLM as a judge. Design an LLM evaluator prompt to determine whether the new response is correct compared to the reference.

  • Evaluate the judge. Compare the LLM judge's evaluations with manual labels to see if they meet the expectations or need tweaking.

We'll start with the reference-based evaluator, which is more complex because it requires passing two columns to the prompt. Then, we'll create a simpler judge focused on verbosity.

To complete the tutorial, you will need:

  • Basic Python knowledge.

  • An OpenAI API key to use for LLM evaluator.

Use the provided code snippets or run a sample notebook.

Jupyter notebook:

Or click to open in Colab.

We recommend running this tutorial in Jupyter Notebook or Google Colab to render rich HTML objects with summary results directly in a notebook cell.

We will work with a toy dataset, which you can replace with your production data.

Installation and Imports

Install Evidently:

Import the required modules:

Pass your OpenAI key:

1. Create the Dataset

First, we'll create a toy Q&A dataset that includes:

  • Questions. The inputs our LLM system got.

  • Target responses. The "approved" responses. You can curate these from previous outputs that you consider accurate.

  • New responses. These are the responses generated by your system that we want to evaluate.

To make it more interesting, we created a synthetic dataset with 15 answers to customer support questions. We also manually labeled each new response as correct or incorrect, with brief comments explaining the decision.

Here's how you can create this dataset in one go:

Create the DataFrame

To preview it:

How do you get the data in practice? You can pick examples from your experiments or production data, focusing on scenarios you want to evaluate. For instance, if you plan to use the LLM evaluator for regression testing, select texts that show different ways a question has been answered, both correctly and incorrectly. You can also use synthetic data.

Why start with manual labels? This process helps you:

  • Refine your criteria. Manually labeling data helps you clarify what you want the LLM judge to detect. It also reveals edge cases so that you can craft more effective evaluator prompts.

  • Evaluate the judge's quality. Manual labels serve as the ground truth. You can then compare the LLM's judgments with these labels to assess its accuracy.

Ultimately, an LLM judge is a small ML system, and it needs its own evals!

Here's the distribution of examples in our small dataset: we have both correct and incorrect responses.

2. Correctness evaluator

Now that we have our labeled dataset, it's time to set up an LLM judge. We'll start with an evaluator that checks if responses are correct compared to the reference. The goal is to match the quality of our manual labels.

We'll use the LLMEval Descriptor to create a custom binary evaluator. Here's how to define the prompt template for correctness:

Explanation:

  • BinaryClassificationPromptTemplate: This template instructs the LLM to classify the input into two classes, explain its reasoning, and format everything neatly. You don't have to worry about asking for these detailsโ€”they're built into the template.

  • target_category and non-target category: The labels we're aiming for - "correct" and "incorrect" in our case.

  • criteria: This is where you describe what the LLM should look for when grading the responses.

  • include_reasoning: This asks the LLM to explain its choice.

  • additional_columns: This allows you to include not just the primary column (the "new_response") but also the "reference_response" for comparison. You then add this column name placeholder to the grading criteria.

In this example, we've set up the prompt to be strict, erring on the side of labeling a correct answer as incorrect is preferable. You can write it differently. This flexibility is one of the key benefits of creating a custom judge.

What else is there? Check the docs on LLM judge feature.

3. Run the evaluation

Now, let's run the evaluation. We'll apply it to the "new_response" column in our dataset and create a report that summarizes how the LLM judged the responses.

This will render an HTML report in the notebook cell. Or, use as_dict() for a Python dictionary output.

But since we're refining our LLM evaluator, we don't want just the label distribution: we want to see what the LLM got right and wrong!

Tracking the evals. When running evaluations in production, upload your results to the Evidently Platform to store them and track them over time.

4. Evaluate the LLM Eval quality

This part is a bit meta: we're going to evaluate the quality of our LLM evaluator itself.

To take a look at the raw outputs:

This will show a DataFrame with newly added scores and explanations.

Note: your results and explanations will vary since LLMs are non-deterministic.

We can also quantify it! We'll treat this like a classification task to measure how accurately the LLM identifies incorrect responses. We'll look at metrics like precision and recall.

Let's create a DataFrame and map our data for classification: the original manual label is the target, and the LLM-provided response is the prediction.

Or use classification_report.as_dict().

Explanation:

  • ClassificationQualityMetric displays precision, recall, accuracy, etc.

  • ClassificationClassBalance shows the distribution of classes (correct vs. incorrect) in the dataset.

  • ClassificationConfusionMatrix illustrates the types of errors.

We have one type of error each, but overall, the results are pretty good! If you want to refine the judge, you can iterate on the prompt and continue improving it.

5. Verbosity evaluator

Next, letโ€™s create a simpler LLM judge that evaluates the verbosity of the responses. This judge will check whether the responses are concise and to the point. This only requires evaluating one column with the output.

This is perfect for production evaluations where you donโ€™t have a reference answer to compare against.

Here's how to set up the prompt template for verbosity:

Run the Report and view the summary results:

Or use as_dict() for a Python dictionary output.

To access the raw results:

Preview:

Don't fully agree with the results? Use these labels as a starting point, and correct the decision where you see fit - now you've got your golden dataset! Next, iterate on your judge prompt.

The LLM judge itself is just one part of your overall evaluation framework. You can now integrate this evaluator into workflows, such as testing your LLM outputs after changing a prompt.

Last updated