# Tutorial - LLM Evaluation

{% hint style="info" %}
**You are looking at the old Evidently documentation**: this API is available with versions 0.6.7 or lower. Check the newer version [here](https://docs.evidentlyai.com/introduction).
{% endhint %}

Evaluating the quality of LLM outputs is essential for building a production-grade LLM application. During development, you need to compare quality with different prompts and detect regressions. Once your app is live, you need to ensure outputs are safe and accurate and understand user behavior.

Manually reviewing individual outputs doesn't scale. This tutorial shows you how to automate LLM evaluations from experiments to production.

You will learn both about the evaluation methods and the workflow to run and track them.

{% hint style="success" %}
**Want a very simple example first?** This ["Hello World"](https://docs-old.evidentlyai.com/get-started/hello-world/oss_quickstart_llm) will take a couple minutes.
{% endhint %}

In this tutorial, you will:

* Prepare a toy chatbot dataset
* Evaluate responses using different methods:
  * Text statistics
  * Text patterns
  * Model-based evaluations
  * LLM-as-a-judge
  * Metadata analysis
* Generate visual Reports to explore evaluation results
* Get a monitoring Dashboard to track metrics over time
* Build a custom Test Suite to run conditional checks

You can run this tutorial locally, with the option to use Evidently Cloud for monitoring. You will work with a Q\&A chatbot example, but the methods will apply to other use cases, such as RAGs and agents.

**Requirements:**

* Basic Python knowledge.
* The open-source Evidently Python library.

**Optional**:

* An OpenAI API key (to use LLM-as-a-judge).
* An Evidently Cloud account (for live monitoring).

Let's get started!

To complete the tutorial, use the provided code snippets or run a sample notebook.

Jupyter notebook:

{% embed url="<https://github.com/evidentlyai/evidently/blob/ad71e132d59ac3a84fce6cf27bd50b12b10d9137/examples/sample_notebooks/llm_evaluation_tutorial.ipynb>" %}

You can also follow the video version:

{% embed url="<https://youtu.be/qwn0UqXJptY>" %}

If you're having problems or getting stuck, reach out on [Discord](https://discord.com/invite/xZjKRaNp8b).

## 1. Installation and imports

Install Evidently in your Python environment:

```python
!pip install evidently[llm]
```

Run the imports. To work with toy data:

```python
import pandas as pd
import numpy as np
import requests
from datetime import datetime, timedelta
from io import BytesIO
```

To run the evals:

```python
from evidently import ColumnMapping
from evidently.report import Report
from evidently.test_suite import TestSuite
from evidently.metric_preset import TextEvals
from evidently.descriptors import *
from evidently.metrics import *
from evidently.tests import *
from evidently.features.llm_judge import BinaryClassificationPromptTemplate
```

To send results to Evidently Cloud:

```python
from evidently.ui.workspace.cloud import CloudWorkspace
```

**Optional**. To remotely manage the dashboard design in Evidently Cloud:

```python
from evidently.ui.dashboards import DashboardPanelTestSuite
from evidently.ui.dashboards import PanelValue
from evidently.ui.dashboards import ReportFilter
from evidently.ui.dashboards import TestFilter
from evidently.ui.dashboards import TestSuitePanelType
from evidently.renderers.html_widgets import WidgetSize
```

## 2. Prepare a dataset

We'll use a dialogue dataset that imitates a company Q\&A system where employees ask questions about HR, finance, etc. You can download the [example CSV file](https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/chat_df.csv) from source or import it using `requests`:

```python
response = requests.get("https://raw.githubusercontent.com/evidentlyai/evidently/main/examples/how_to_questions/chat_df.csv")
csv_content = BytesIO(response.content)
```

Convert it into the pandas DataFrame. Parse dates and set conversation "start\_time" as index:

```python
assistant_logs = pd.read_csv(csv_content, index_col=0, parse_dates=['start_time', 'end_time'])
assistant_logs.index = assistant_logs.start_time
assistant_logs.index.rename('index', inplace=True)
```

To get a preview:

```python
pd.set_option('display.max_colwidth', None)
assistant_logs.head(3)
```

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-3ad63731e6a1736cf290af0c45f400d18017d23d%2Fllm_data_preview-min.png?alt=media)

{% hint style="info" %}
**How to collect data?**: you can use the open-source `tracely` library to collect the inputs and outputs from your LLM app. Check the [Tracing Quickstart](https://github.com/evidentlyai/docs-old/blob/main/examples/cloud_quickstart_tracing.md). You can then download the traced Dataset for evaluation.
{% endhint %}

{% hint style="success" %}
**How to pass an existing dataset?** You can import a pandas DataFrame with flexible structure. Include any text columns (e.g., inputs and responses), DateTime, and optional metadata like ID, feedback, model type, etc. If you have multi-turn conversations, parse them into a table by session or input-output pairs.
{% endhint %}

## 3. Create a Project

{% hint style="info" %}
**This step is optional**. You can also run the evaluations locally without sending results to the Cloud.
{% endhint %}

To be able to save and share results and get a live monitoring dashboard, create a Project in Evidently Cloud. Here's how to set it up:

* **Sign up**. If you do not have one yet, create a free [Evidently Cloud account](https://app.evidently.cloud/signup) and name your Organization.
* **Get an Organization ID**. Get an ID of your organization on the [organizations page](https://app.evidently.cloud/organizations).
* **Get your API token**. Click the **Key** icon in the left menu to go. Generate and save the token. ([Token page](https://app.evidently.cloud/token)).
* **Connect to Evidently Cloud**. Pass your API key to connect.

```python
ws = CloudWorkspace(token="YOUR_TOKEN", 
                    url="https://app.evidently.cloud")
```

* **Create a Project**. Create a new Project inside your Organization, adding your title and description:

```python
project = ws.create_project("My project title", org_id="YOUR_ORG_ID")
project.description = "My project description"
project.save()
```

## 4. Run evaluations

You will now learn how to apply different methods to evaluate your text data.

* **Text statistics**. Evaluate simple properties like text length.
* **Text patterns**. Detect specific words or regular patterns.
* **Model-based evals**. Use ready-made ML models to score data (e.g., by sentiment).
* **LLM-as-a-judge**. Prompt LLMs to categorize or score texts by custom criteria.
* **Similarity metrics**. Measure semantic similarity between pairs of text.

To view the evaluation results, you will generate visual Reports in your Python environment. In the following sections of the tutorial, you'll also explore other formats like conditional Test Suites and live monitoring Dashboards.

It is recommended to map the data schema to make sure it is parsed correctly.

**Create column mapping**. Identify the type of columns in your data. Pointing to a "datetime" column will also add a time index to the plots.

```python
column_mapping = ColumnMapping(
    datetime='start_time',
    datetime_features=['end_time'],
    text_features=['question', 'response'],
    categorical_features=['organization', 'model_ID', 'region', 'environment', 'feedback'],
)
```

Now, let's run evaluations!

{% hint style="info" %}
**You can skip steps**. Each example below is self-contained, so you can skip any of them or head directly to Step 6 to see the monitoring flow.
{% endhint %}

### Text statistics

Let's run a simple evaluation to understand the basic flow.

**Evaluate text length**. Generate a Report to evaluate the length of texts in the "response" column. Run this check for the first 100 rows in the `assistant_logs` dataframe:

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  TextLength(),
                  ]
              )
])

text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[:100],
                      column_mapping=column_mapping)
text_evals_report
```

This calculates the number of symbols in each text and shows a summary in your notebook cell. (You can also export it in other formats - see step 5).

You can see the distribution of text length across all responses and descriptive statistics like the mean or minimal text length.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-1767d7dc10d038ff87077cb9dd19ede80e05c4aa%2Fllm_tutorial_text_length-min.png?alt=media)

Click on "details" to see how the mean text length changes over time. The index comes from the `datetime` column you mapped earlier. This helps you notice any temporal patterns, such as if texts are longer or shorter during specific periods.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-27144445205e07aed689207a3b15da064c9f5e39%2Fllm_tutorial_text_length_plot-min.png?alt=media)

**Get a side-by-side comparison**. You can also generate statistics for two datasets at once. For example, compare the outputs of two different prompts or data from today against yesterday.

Pass one dataset as `reference` and another as `current`. For simplicity, let's compare the first and next 50 rows from the same dataframe:

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  TextLength(),
                  ]
              )
])

text_evals_report.run(reference_data=assistant_logs[:50],
                      current_data=assistant_logs[50:100],
                      column_mapping=column_mapping)
text_evals_report
```

You will now see the summary results for both datasets:

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-f52f526b83bcd38b512a17ad50d493b00a41dbac%2Fllm_tutorial_side_by_side-min.png?alt=media)

Each evaluation that computes a score for every text in the dataset is called a `descriptor`. Descriptors can be numerical (like the `TextLength()` you just used) or categorical.

Evidently has many built-in descriptors. For example, try other simple statistics like `SentenceCount()` or `WordCount()`. We'll show more complex examples below.

{% hint style="success" %}
**List of all descriptors** See all available descriptors in the "Descriptors" section of [All Metrics](https://docs.evidentlyai.com/reference/all-metrics) table.
{% endhint %}

### Text patterns

You can use regular expressions to identify text patterns. For example, check if the responses mention competitors, named company products, include emails, or specific topical words. These descriptors return a binary score ("True" or "False") for pattern matches.

Let's check if responses contain words related to compensation (such as salary, benefits, or payroll). Pass this word list to the `IncludesWords` descriptor. This will also check for word variants.

Add an optional display name for this eval:

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response",
              descriptors=[
                  IncludesWords(
                      words_list=['salary', 'benefits', 'payroll'],
                      display_name="Mention Compensation")
            ]
        ),
        ]
)

text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[:100],
                      column_mapping=column_mapping)
text_evals_report
```

Here is an example result. You can see that 10 responses out of 100 relate to the topic of compensation as defined by this word list. "Details" show occurrences in time.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-33da8a022bfbee16d8df97d63a74235b1138ed19%2Fllm_tutorial_mentions_compensation-min.png?alt=media)

Such pattern evals are fast and cheap to compute at scale. You can try other descriptors like:

* `Contains(items=[])` for non-vocabulary words like competitor names or longer expressions,
* `BeginsWith(prefix="")` for specific starting sequence,
* Custom `RegEx(reg_exp=r"")`, etc.

### Model-based scoring

You can use pre-trained machine learning models to score your texts. Evidently has:

* Built-in model-based descriptors like `Sentiment`.
* Wrappers to call external models published on HuggingFace.

Let's start with a **Sentiment** check. This returns a sentiment score from -1 (very negative) to 1 (very positive).

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            Sentiment(),
        ]
    ),
])

text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[:100],
                      column_mapping=column_mapping)
text_evals_report
```

You will see the distribution of response sentiment. Most are positive or neutral, but there are a few chats with a negative sentiment.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-d0674b9ef83bc6654079a8842a59f33306a99232%2Fllm_tutorial_sentiment-min.png?alt=media)

In "details", you can look at specific times when the average sentiment of responses dipped:

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-c7970f2ccf3568fb6f0786c9b694d4770674a1d1%2Fllm_tutorial_sentiment_2-min.png?alt=media)

To review specific responses with sentiment below zero, you can also export the dataset with scores. We'll show this later on.

Let's first see how to use external models from HuggingFace. There are two options:

* **Pre-selected models**, like **Toxicity**. Pass the `HuggingFaceToxicityModel()` descriptor. This [model](https://huggingface.co/spaces/evaluate-measurement/toxicity) returns a predicted toxicity score between 0 to 1.
* **Custom models**, where you specify the model name and output to use. For example, let's call the `SamLowe/roberta-base-go_emotions` [model](https://huggingface.co/SamLowe/roberta-base-go_emotions) using the general `HuggingFaceModel` descriptor. This model classifies text into 28 emotions. If you pick the "neutral" label, the descriptor will return the predicted score from 0 to 1 on whether responses convey neutral emotion.

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            HuggingFaceToxicityModel(),
            HuggingFaceModel(
                model="SamLowe/roberta-base-go_emotions",
                params={"label": "neutral"},
                display_name="Response Neutrality"),
        ]
    ),
])

text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[:100],
                      column_mapping=column_mapping)
text_evals_report
```

In each case, the descriptor first downloads the model from HuggingFace to your environment and then uses it to score the data. It takes a few moments to load the model.

**How to interpret the results?** It's typical to use a predicted score above 0.5 as a "positive" label. The toxicity score is near 0 for all responses - nothing to worry about! For neutrality, most responses have predicted scores above the 0.5 threshold, but a few are below. You can review them individually.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-37a876d0d74ad86c1da6b078dbb8daa5d68c2c89%2Fllm_tutorial_neutrality-min.png?alt=media)

{% hint style="info" %}
**Choosing other models**. You can choose other models, e.g. to score texts by topic. See [docs](https://docs-old.evidentlyai.com/user-guide/customization/huggingface_descriptor).
{% endhint %}

### LLM as a judge

For more complex or nuanced checks, you can use LLMs as a judge. This requires creating an evaluation prompt asking LLMs to assess the text by specific criteria, such as tone or conciseness.

{% hint style="info" %}
**This step is optional**. You'll need an OpenAI API key and will incur costs by running the evaluation. Skip if you don't want to use external LLMs.
{% endhint %}

**Pass the OpenAI key**. It is recommended to pass the key as an environment variable. [See Open AI docs](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) for best practices.

```python
## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"
```

**Run template evals**. Let's start with built-in prompt templates.

* `DeclineLLMEval()` checks if the response contains a denial.
* `PIILLMEval()` checks if the response contains personally identifiable information. You can also ask to provide for a reasoning of the score.

To minimize API calls, we will pass only 10 data rows.

```python
report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        DeclineLLMEval(),
        PIILLMEval(include_reasoning=True), 
    ])
])

report.run(reference_data= None,
           current_data= assistant_logs[:10],
           column_mapping=column_mapping)
report 
```

**Create a custom judge**. You can also define your own LLM judge with a custom prompt. To illustrate, let's ask the LLM to judge whether the provided responses are concise and return a `Concise` or `Verbose` label with an explanation. (Or `Unknown` if not sure).

```python
custom_judge = LLMEval(
    subcolumn="category",
    template = BinaryClassificationPromptTemplate(      
        criteria = """Conciseness refers to the quality of being brief and to the point, while still providing all necessary information.
            A concise response should:
            - Provide the necessary information without unnecessary details or repetition.
            - Be brief yet comprehensive enough to address the query.
            - Use simple and direct language to convey the message effectively.
        """,
        target_category="concise",
        non_target_category="verbose",
        uncertainty="unknown",
        include_reasoning=True,
        pre_messages=[("system", "You are a judge which evaluates text.")],
        ),
    provider = "openai",
    model = "gpt-4o-mini",
    display_name="Conciseness",
)
```

Include the `custom_judge` descriptor to the Report:

```python
report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        custom_judge
    ])
])

report.run(reference_data= None,
           current_data= assistant_logs[:10],
           column_mapping=column_mapping)
report 
```

All our responses are concise - great! To see the individual scores, you can publish a dataframe (see Step 5), or send the results to Evidently Cloud.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-beaba1309ba323abd8c3bf1510e8b1c76d06edee%2Fllm_tutorial_conciseness.png?alt=media)

{% hint style="info" %}
**How to create your own judge**. You can create custom prompts, and optionally pass the context or reference answer alongside the response. See [docs](https://docs-old.evidentlyai.com/user-guide/customization/llm_as_a_judge)
{% endhint %}

### Metadata summary

Our dataset also includes user upvotes and downvotes in a categorical `feedback` column. You can easily add summaries for any numerical or categorical column to the Report.

To add a summary on the “feedback” column, use `ColumnSummaryMetric()`:

```python
data_report = Report(metrics=[
   ColumnSummaryMetric(column_name="feedback"),
   ]
)

data_report.run(reference_data=None, current_data=assistant_logs[:100], column_mapping=column_mapping)
data_report
```

You will see a distribution of upvotes and downvotes.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-a062ce72a292bb7644c810ec07ea294ac20862ac%2Fllm_feedback_one-min.png?alt=media)

### Semantic Similarity

You can evaluate how closely two texts are in meaning using an embedding model. This descriptor requires you to define two columns. In our example, we can compare Responses and Questions to see if the chatbot answers are semantically relevant to the question.

This descriptor converts all texts into embeddings, measures Cosine Similarity between them, and returns a score from 0 to 1:

* 0 means that texts are opposite in meaning;
* 0.5 means that texts are unrelated;
* 1 means that texts are semantically close.

To compute the Semantic Similarity:

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        SemanticSimilarity(with_column="question", 
                           display_name="Response-Question Similarity"),
    ])
])

text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[:100],
                      column_mapping=column_mapping)
text_evals_report
```

In our examples, the semantic similarity always stays above 0.81, which means that answers generally relate to the question.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-981ef19fa4098f763db95cfd474ba537f7622029%2Fllm_tutorial_semantic_similarity-min.png?alt=media)

## 5. Export results

{% hint style="info" %}
**This is optional**. You can proceed without exporting the results.
{% endhint %}

You can export the evaluation results beyond viewing the visual Reports in Python. Here are some options.

**Publish a DataFrame**. Add computed scores (like semantic similarity, or LLM-based scores with an explanation) directly to your original dataset. This will let you further analyze the data, like identifying examples with the lowest scores.

```python
text_evals_report.datasets().current
```

**Python dictionary**. Get summary scores as a dictionary. Use it to export specific values for further pipeline actions:

```python
text_evals_report.as_dict()
```

**JSON**. Export summary scores as JSON:

```python
text_evals_report.json()
```

**HTML**. Save a visual HTML report as a file:

```python
text_evals_report.save_html("report.html")
```

You can also send the results to Evidently Cloud for monitoring!

## 6. Monitor results over time

In this section, you will learn how to monitor evaluations using Evidently Cloud. This allows you to:

* **Track offline experiment results**. Keep records of evaluation scores from different experiments, like comparing output quality using different prompts.
* **Run evaluations in production**. Periodically evaluate batches or samples of production data, such as hourly or daily.

Here's how you can set this up.

**Define the evaluations**. First, let's design a Report. This will specify what you want to evaluate.

Say, you want to compute summaries for metadata columns and evaluate text length, sentiment, and mentions of compensation in chatbot responses.

```python
text_evals_report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
            Sentiment(),
            TextLength(),
            IncludesWords(words_list=['salary', 'benefits', 'payroll'],
                          display_name="Mention Compensation")

        ],
    ),
    ColumnSummaryMetric(column_name="feedback"),
    ColumnSummaryMetric(column_name="region"),
    ColumnSummaryMetric(column_name="organization"),
    ColumnSummaryMetric(column_name="model_ID"),
    ColumnSummaryMetric(column_name="environment"),
])
```

You can include more complex checks like LLM-as-a-judge in the same way: just list the corresponding descriptor.

**Run the Report**. Compute the Report for the first 50 rows:

```python
text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[:50],
                      column_mapping=column_mapping)
```

**Upload the results**. Send the Report to the Evidently Cloud Project you created earlier:

```python
ws.add_report(project.id, text_evals_report)
```

**View the Report**. Go to the Project and open the Reports section using the menu on the left.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-329a5b10a9ce7ff327659efeb9a8a34029a81c34%2Fview_report-min.gif?alt=media)

A single Report gives us all the information right there. But as you run more checks, you will want to see how values change over time. Let's imitate a few consecutive runs to evaluate more batches of data.

**Imitate ongoing evaluations**. Run and send several Reports, each time taking the next 50 rows of data. For illustration, we repeat the runs. In practice, you would compute each Report after new experiments or as you get a new batch of production data to evaluate.

Run the Report for the next 50 rows of data:

```python
text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[50:100],
                      column_mapping=column_mapping)
ws.add_report(project.id, text_evals_report)
```

<details>

<summary>And a few more times!</summary>

Run 3:

```python
text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[100:150],
                      column_mapping=column_mapping)
ws.add_report(project.id, text_evals_report)
```

Run 4:

```python
text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[150:200],
                      column_mapping=column_mapping)
ws.add_report(project.id, text_evals_report)
```

Run 5:

```python
text_evals_report.run(reference_data=None,
                      current_data=assistant_logs[200:250],
                      column_mapping=column_mapping)
ws.add_report(project.id, text_evals_report)
```

</details>

Now you will have 5 Reports in the Project. Let's get a dashboard!

**Get a Monitoring Dashboard**. You can start with pre-built templates.

* Go to Project Dashboard.
* Enter the edit mode by clicking on the "Edit" button in the top right corner.
* Choose "Add Tab",
* Add a "Descriptors" Tab and then a "Columns" Tab.
* Use the "Show in Order" toggle above the dashboard to ignore the time gaps.

You will instantly get a dashboard with evaluation results over time.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-90ba6bd5855c98d142a2d0be94ed14627dc558ff%2Fcreate_tabs-min.gif?alt=media)

In the "Descriptors" tab, you will see how the distributions of the text evaluation results. For example, you can notice a dip in mean Sentiment in the fourth evaluation run.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-b9ac021e5442c29c246f1024014d06fb80f0c020%2Fllm_tutorial_sentiment_over_time-min.png?alt=media)

In the "Columns" tab, you can see all the metadata summaries over time. For example, you can notice that all responses in the last run were generated with gpt-3.5.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-3a2ab30a76109be50d85474f87d5d85542274e83%2Fllm_tutorial_modelID_distribution-min.png?alt=media)

You can also add alerting conditions for specific values.

{% hint style="success" %}
**Monitoring Panel types**. In addition to Tabs, you can choose monitoring panels one by one. You can choose panel title, type (bar, line chart), etc. Read more on [available Panels](https://docs.evidentlyai.com/user-guide/monitoring/design_dashboard).
{% endhint %}

## 7. Run conditional tests

So far, you've used Reports to summarize evaluation outcomes. However, you often want to set specific conditions for the metric values. For example, check if all texts fall within the expected length range and review results only if something goes wrong.

This is where you can use an alternative interface called `TestSuites`. It will look like this:

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-23aadf8d56b7d752876cca4486c1cff708677f29%2Fllm_tutorial_test_results-min.png?alt=media)

Test Suites work similarly to `Reports`, but instead of listing `metrics`, you define `tests` and set conditions using parameters like `gt` (greater than), `lt` (less than), `eq` (equal), etc.

**Define a Test Suite**. Let’s create a simple example:

```python
test_suite = TestSuite(tests=[
    TestColumnValueMean(column_name = Sentiment().on("response"), gte=0),
    TestColumnValueMin(column_name = TextLength().on("response"), gt=0),
    TestColumnValueMax(column_name = TextLength().on("response"), lte=2000),
    TestColumnValueMean(column_name = TextLength().on("response"), gt=500),
])
```

This test checks the following conditions:

* Average response sentiment is positive.
* Response length is always non-zero.
* Maximum response length does not exceed 2000 symbols (e.g., due to chat window constraints).
* Mean response length is above 500 symbols (e.g., this is a known pattern).

{% hint style="success" %}
**How to test set test conditions**. [Read more about Tests](https://docs.evidentlyai.com/user-guide/tests-and-reports/custom-test-suite). You can use other descriptors and tests. For example, use `TestCategoryShare` to check if the share of responses labeled "Concise" by the LLM judge is above a certain threshold. You can also automatically generate conditions from a reference dataset (e.g. expect +/- 10% of the reference values).
{% endhint %}

**Compute multiple Test Suites**. Let's simulate running 5 Test Suites sequentially, each on 50 rows of data, with timestamps spaced hourly:

```python
for i in range(5):
    test_suite.run(
        reference_data=None,
        current_data=assistant_logs.iloc[50 * i : 50 * (i + 1), :],
        column_mapping=column_mapping,
        timestamp=datetime.now() + timedelta(hours=i)
    )
    ws.add_test_suite(project.id, test_suite)
```

We use a cycle for demonstration. In production, you would run these checks sequentially.

**Add a test monitoring Panel**. Now, let's add a simple panel to display Test results over time. You can manage dashboards in the UI (like you did before) or programmatically. Let's now explore how to do it from Python.

Load the latest dashboard configuration to Python. If you skip this step, the new Test panels will override the Tabs you added earlier.

Copy the Project ID from above the dashboard:

```python
project = ws.get_project("PROJECT_ID")
```

Next, create a Test panel within the "Tests" tab to display detailed test results:

```python
project.dashboard.add_panel(
    DashboardPanelTestSuite(
        title="Test results",
        filter=ReportFilter(metadata_values={}, tag_values=[], include_test_suites=True),
        size=WidgetSize.FULL,
        panel_type=TestSuitePanelType.DETAILED,
        time_agg="1D",
    ),
    tab="Tests"
)
project.save()
```

**View the test results in time**. Go to the Evidently Cloud dashboard to see the history of all tests. You can notice that a single test failed in the last run. If you hover on the specific test, you can see that we failed the mean text length condition.

**View the individual Test Suite**. To debug, open the latest Test Suite. In "Details," you will see the distribution of text length and the current mean value, which is just slightly below the set threshold.

![](https://256125905-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FeE67gM4508ESQxkbpOxj%2Fuploads%2Fgit-blob-85b32d8d54c257e0248590907ef4df090c2ffe5e%2Fview_test_suites-min.gif?alt=media)

When can you use these Test Suites? Here are two ideas:

* **Regression testing**. Run Test Suites whenever you change prompt or app parameters to compare new responses with references or against set criteria.
* **Continuous testing**. Run Test Suites periodically over production logs to check that the output quality stays within expectations.

You can also set up alerts to get a notification if your Tests contain failures.

{% hint style="success" %}
**What is regression testing?**. Check a separate tutorial on the [regression testing workflow](https://www.evidentlyai.com/blog/llm-testing-tutorial).
{% endhint %}

## What's next?

Here are some of the things you might want to explore next:

* **Explore other Reports**. For example, if your LLM solves a classification or retrieval task, you can evaluate classification or ranking quality. See available [Presets](https://docs.evidentlyai.com/presets), [Metrics](https://docs.evidentlyai.com/reference/all-metrics), and [Tests](https://docs.evidentlyai.com/reference/all-tests) to see other checks you can run.
* **Design the monitoring**. Read more about how to add monitoring panels, configure alerts, or send data in near real-time in the [Monitoring User Guide](https://docs.evidentlyai.com/user-guide/monitoring/monitoring_overview).

Need help? Ask in our [Discord community](https://discord.com/invite/xZjKRaNp8b).
