Quickstart - LLM evaluations

LLM evaluation "Hello world."

circle-info

You are looking at the old Evidently documentation: this API is available with versions 0.6.7 or lower and Evidently Cloud v1. Check the newer docs version herearrow-up-right.

This quickstart shows how to evaluate text data, such as inputs and outputs from your LLM system.

You will run evals locally in Python and send results to Evidently Cloud for analysis and monitoring.

Need help? Ask on Discordarrow-up-right.

1. Set up Evidently Cloud

Set up your Evidently Cloud workspace:

Now, switch to your Python environment.

2. Installation

Install the Evidently Python library:

!pip install evidently[llm]

Import the components to run the evals:

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import TextEvals
from evidently.descriptors import *

Import the components to connect with Evidently Cloud:

3. Create a Project

Connect to Evidently Cloud using your API token:

Create a Project within your Organization:

4. Import the toy dataset

Prepare your data as a pandas dataframe with texts and metadata columns. Here’s a toy chatbot dataset with "Questions" and "Answers".

circle-info

Collecting live data: use the open-source tracely library to collect the inputs and outputs from your LLM app. Check the Tracing Quickstart. You can then download the traced dataset for evaluation.

5. Run your first eval

You have two options:

  • Run evals that work locally.

  • Use LLM-as-a-judge (requires an OpenAI token).

Define your evals. You will evaluate all "Answers" for:

  • Sentiment: from -1 for negative to 1 for positive.

  • Text length: character count.

  • Presence of "sorry" or "apologize": True/False.

Each evaluation is a descriptor. You can choose from multiple built-in evaluations or create custom ones, including LLM-as-a-judge.

6. Send results to Evidently Cloud

Upload the Report and include raw data for detailed analysis:

View the Report. Go to Evidently Cloudarrow-up-right, open your Project, and navigate to "Reports" in the left.

You will see the scores summary, and the dataset with new descriptor columns. For example, you can sort to find all answers with "Denials".

7. Get a dashboard

Go to the "Dashboard" tab and enter the "Edit" mode. Add a new tab, and select the "Descriptors" template.

You'll see a set of panels that show descriptor values. Each has a single data point. As you log ongoing evaluation results, you can track trends and set up alerts.

What's next?

Explore the full tutorial for advanced workflows: custom LLM judges, conditional test suites, monitoring, and more.

Tutorial - LLM Evaluationchevron-right

Last updated