LLM Regression Testing
How to run regression testing for LLM outputs.
In this tutorial, weβll show you how to do regression testing for LLM outputs. Youβll learn how to compare new and old responses after changing a prompt, model, or anything else in your system. By re-running the same inputs, you can spot any significant changes. This helps you push updates with confidence or identify issues to fix.
Tutorial scope
Here's what we'll do:
Create a toy dataset. Build a small Q&A dataset with answers and reference responses.
Get new answers. Imitate generating new answers to the same question we want to compare.
Create and run a Test Suite. Compare the answers using LLM-as-a-judge to evaluate length, correctness and style match.
Build a monitoring Dashboard. Get plots to track the results of Tests over time.
To complete the tutorial, you will need:
Basic Python knowledge.
An OpenAI API key to use for the LLM evaluator.
An Evidently Cloud account to track test results. If not yet, sign up for a free account.
Use the provided code snippets or run a sample notebook.
Jupyter notebook:
Or click to open in Colab.
1. Installation and Imports
Install Evidently:
Import the required modules:
To connect to Evidently Cloud:
To create monitoring panels as code:
Pass your OpenAI key:
2. Create a Project
Connect to Evidently Cloud. Replace with your actual token:
Create a Project:
3. Prepare the Dataset
Create a dataset with questions and reference answers. We'll later compare the new LLM responses against them:
Get a quick preview:
Here is how the data looks:

You might want to have a quick look at some data statistics to help you set conditions for Tests. Let's check the text length distribution. This will render a summary Report directly in the notebook cell.
If you work in a non-interactive Python environment, call report.as_dict() or report.json() instead.
Here is the distribution of text length:

4. Get new answers
Suppose you generate new responses using your LLM after changing a prompt. We will imitate it by adding a new column with new responses to the DataFrame:
Here is the resulting dataset with the added new column:

5. Design the Test suite
To compare new answers with old ones, we need evaluation metrics. You can use deterministic or embeddings-based metrics like SemanticSimilarity. However, you often need more custom criteria. Using LLM-as-a-judge is useful for this, letting you define what to detect.
Letβs design our Tests:
Length check. All new responses must be between 80 and 200 symbols.
Correctness. All new responses should give the same answer without contradictions.
Style. All new responses should match the style of the reference.
Text length is easy to check, but for Correctness and Style checks, we'll write our custom LLM judges.
Correctness judge
We implement the correctness evaluator, using an Evidenty template for binary classification. We ask the LLM to classify each response as correct or incorrect based on the {target_response} column and provide reasoning for its decision.
We recommend splitting each evaluation criterion into separate judges and using a simple grading scale, like binary classifiers, for better reliability.
Style judge
Using a similar approach, we'll create a judge for style. We'll also add clarifications to define what we mean by a style match.
Complete Test Suite
Now, we can create a Test Suite that includes checks for correctness, style matching, and text length.
Choose Tests. We select Evidently column-level tests like
TestCategoryCountandTestShareOfOutRangeValues. (You can pick other Tests, likeTestColumnValueMinorTestColumnValueMean).Set Parameters and Conditions. Some Tests require parameters: for example,
leftandrightto set the allowed range for Text Length. For Test fail conditions, use parameters likegt(greater than),lt(less than),eq(equal), etc.Set non-critical Tests. Identify non-critical Tests, like the style match check, to trigger warnings instead of fails. This helps visually separate them on monitoring panels and set alerts only for critical failures.
We reference our two LLM judges, style_eval and correctness_eval, and apply them to the response column in our dataset. For text length, we use the built-in TextLength() descriptor for the same column.
In this example, we expect the share of failures to be zero using the eq=0 condition. You can adjust this, such as using lte=0.1, which means "less than 10%". This would cause the Test to fail if more than 10% of rows are out of the set length range.
Allowing some share of Tests to fail is convenient for real-world applications.
You can add additional Tests as you see fit for regular expressions, word presence, etc. and Tests for other columns in the same Test Suite.
6. Run the Test Suite
Now that our Test Suite is ready - let's run it!
To apply this Test Suite to the eval_data that we prepared earlier:
This will compute the Test Suite: but how do you see it? You can preview the results in your Python notebook (call test_suite). However, weβll now send it to Evidently Cloud along with the scored data:
Including data is optional but useful for most LLM use cases since you'd want to see not just the aggregate Test results but also the raw texts to debug when Tests fail.
To view the results, navigate to the Evidently Platform. Go to the (Home Page), enter your Project, and find the "Test Suites" section in the left menu. Here, you'll see the Test Suite you can explore.
You'll find both the summary Test results and the Dataset with added scores and explanations. You can zoom in on specific evaluations, such as sorting the data by Text Length or finding rows labeled as "incorrect" or "style-mismatched".

Note: your explanations will vary since LLMs are non-deterministic.
7. Test again
Let's say you made yet another change to the prompt. Our reference dataset stays the same, but we generate a new set of answers that we want to compare to this reference.
Here is the toy eval_data_2 to imitate the result of the change.
Now, we can apply the same Test Suite to this data and send it to Evidently Cloud.
If you go and open the new Test Suite results, you can again explore the outcomes and explanations.

8. Get a Dashboard
You can continue running Test Suites in this manner. As you run multiple, you may want to track Test results over time.
You can easily add this to a Dashboard, both in UI or programmatically. Let's create a couple of Panels using Dashboards as a code approach.
The following code will add:
A counter panel to show the SUCCESS rate of the latest Test run.
A test monitoring panel to show all Test results over time.
When you navigate to the UI, you will now see a Panel which shows a summary of Test results (Success, Failure, and Warning) for each Test Suite we ran. As you add more Tests to the same Project, the Panels will be automatically updated to show new Test results.

If you hover over individual Test results, you will able to see the specific Test and conditions.
What's next? As you design a similar Test Suite for your use case, you can integrate it with CI/CD workflows to run on every change.
Last updated