LogoLogo
HomeBlogGitHub
latest
latest
  • New DOCS
  • What is Evidently?
  • Get Started
    • Evidently Cloud
      • Quickstart - LLM tracing
      • Quickstart - LLM evaluations
      • Quickstart - Data and ML checks
      • Quickstart - No-code evaluations
    • Evidently OSS
      • OSS Quickstart - LLM evals
      • OSS Quickstart - Data and ML monitoring
  • Presets
    • All Presets
    • Data Drift
    • Data Quality
    • Target Drift
    • Regression Performance
    • Classification Performance
    • NoTargetPerformance
    • Text Evals
    • Recommender System
  • Tutorials and Examples
    • All Tutorials
    • Tutorial - Tracing
    • Tutorial - Reports and Tests
    • Tutorial - Data & ML Monitoring
    • Tutorial - LLM Evaluation
    • Self-host ML Monitoring
    • LLM as a judge
    • LLM Regression Testing
  • Setup
    • Installation
    • Evidently Cloud
    • Self-hosting
  • User Guide
    • 📂Projects
      • Projects overview
      • Manage Projects
    • 📶Tracing
      • Tracing overview
      • Set up tracing
    • 🔢Input data
      • Input data overview
      • Column mapping
      • Data for Classification
      • Data for Recommendations
      • Load data to pandas
    • 🚦Tests and Reports
      • Reports and Tests Overview
      • Get a Report
      • Run a Test Suite
      • Evaluate Text Data
      • Output formats
      • Generate multiple Tests or Metrics
      • Run Evidently on Spark
    • 📊Evaluations
      • Evaluations overview
      • Generate snapshots
      • Run no code evals
    • 🔎Monitoring
      • Monitoring overview
      • Batch monitoring
      • Collector service
      • Scheduled evaluations
      • Send alerts
    • 📈Dashboard
      • Dashboard overview
      • Pre-built Tabs
      • Panel types
      • Adding Panels
    • 📚Datasets
      • Datasets overview
      • Work with Datasets
    • 🛠️Customization
      • Data drift parameters
      • Embeddings drift parameters
      • Feature importance in data drift
      • Text evals with LLM-as-judge
      • Text evals with HuggingFace
      • Add a custom text descriptor
      • Add a custom drift method
      • Add a custom Metric or Test
      • Customize JSON output
      • Show raw data in Reports
      • Add text comments to Reports
      • Change color schema
    • How-to guides
  • Reference
    • All tests
    • All metrics
      • Ranking metrics
    • Data drift algorithm
    • API Reference
      • evidently.calculations
        • evidently.calculations.stattests
      • evidently.metrics
        • evidently.metrics.classification_performance
        • evidently.metrics.data_drift
        • evidently.metrics.data_integrity
        • evidently.metrics.data_quality
        • evidently.metrics.regression_performance
      • evidently.metric_preset
      • evidently.options
      • evidently.pipeline
      • evidently.renderers
      • evidently.report
      • evidently.suite
      • evidently.test_preset
      • evidently.test_suite
      • evidently.tests
      • evidently.utils
  • Integrations
    • Integrations
      • Evidently integrations
      • Notebook environments
      • Evidently and Airflow
      • Evidently and MLflow
      • Evidently and DVCLive
      • Evidently and Metaflow
  • SUPPORT
    • Migration
    • Contact
    • F.A.Q.
    • Telemetry
    • Changelog
  • GitHub Page
  • Website
Powered by GitBook
On this page
  • 1. Installation
  • 2. Create a toy dataset
  • 3. Run your first eval
  • 4. Use LLM as a judge (Optional)
  • What's next?
  1. Get Started
  2. Evidently OSS

OSS Quickstart - LLM evals

Run your first LLM evaluation using Evidently open-source.

PreviousEvidently OSSNextOSS Quickstart - Data and ML monitoring

Last updated 2 months ago

You are looking at the old Evidently documentation: this API is available with versions 0.6.7 or lower. Check the newer version .

This quickstart shows how to evaluate text data, such as inputs and outputs from your LLM system.

It's best to run this example in Jupyter Notebook or Google Colab so that you can render HTML Reports directly in a notebook cell.

1. Installation

Install the Evidently library.

!pip install evidently[llm]

Import the required modules:

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import TextEvals
from evidently.descriptors import *

2. Create a toy dataset

Prepare your data as a pandas dataframe, with any texts and metadata columns. Here’s a toy example with chatbot "Questions" and "Answers":

data = [
    ["What's the capital of France?", "The capital of France is Paris."],
    ["Who wrote 1984?", "George Orwell wrote 1984."], 
    ["How does photosynthesis work?", "Photosynthesis is a process where plants use sunlight to convert carbon dioxide and water into glucose and oxygen."],
    ["Can you give me the recipe for making pancakes?", "Sure! Here's a simple recipe: mix 1 cup flour, 1 cup milk, 1 egg, and a pinch of salt. Cook on a hot griddle until golden brown."],
    ["What is the largest planet in our solar system?", "Jupiter is the largest planet in our solar system."],
    ["Tell me a joke.", "Why don't scientists trust atoms? Because they make up everything!"],
    ["Can you translate 'Hello' into Spanish?", "'Hello' in Spanish is 'Hola'."],
    ["What's the code to the universe?", "I'm sorry, I can't provide that information."],
    ["What's the distance between Earth and the Sun?", "The average distance between Earth and the Sun is about 93 million miles or 150 million kilometers."],
    ["How do I fix a flat tire?", "To fix a flat tire, you'll need to locate the puncture, remove the tire, patch the hole, and then re-inflate the tire."]
]

columns = ["question", "answer"]

eval_dataset = pd.DataFrame(data, columns=columns)

Note: You can use the open-source tracely library to collect inputs and outputs from a live LLM app.

3. Run your first eval

Run evaluations for the "Answer" column:

  • Sentiment (from -1 for negative to 1 for positive)

  • Text length (number of symbols))

  • Presence of "sorry" or "apologize" (True/False)

text_evals_report = Report(metrics=[
    TextEvals(column_name="answer", descriptors=[
        Sentiment(),
        TextLength(),
        IncludesWords(words_list=['sorry', 'apologize'], display_name="Denials"),        
        ]
    ),
])

text_evals_report.run(reference_data=None, current_data=eval_dataset)

Each evaluation is a descriptor. You can choose from many built-in evaluations or create custom ones.

View the Report in Python to see the distribution of scores:

text_evals_report

You can also export the dataset with added descriptors for each row.

text_evals_report.datasets().current

Or get a dictionary with results:

text_evals_report.datasets().as_dict

4. Use LLM as a judge (Optional)

To run this, you'll need an OpenAI key.

## import os
## os.environ["OPENAI_API_KEY"] = "YOUR KEY"

Run a Report with the new DeclineLLMEval. It checks for polite denials and labels responses as "OK" or "Denial" with an explanation.

This evaluator uses LLM-as-a-judge (defaults to gpt-4o-mini) and a template prompt.

text_evals_report = Report(metrics=[
    TextEvals(column_name="answer", descriptors=[
        Sentiment(),
        TextLength(),
        DeclineLLMEval(),
        ]
    ),
])

text_evals_report.run(reference_data=None, current_data=evaluation_dataset)

View the Report in Python:

text_evals_report

View the dataset with scores and explanation:

text_evals_report.datasets().current

Or get a dictionary with results:

text_evals_report.datasets().as_dict

What's next?

Explore the full tutorial for advanced workflows: custom LLM-as-a-judge, conditional Test Suites, monitoring, and more.

You can also send evaluation results to Evidently Cloud to analyze and track them. See the Quickstart:

Set the OpenAI key (it's best to pass it as an environment variable). for best practices.

Need help? Ask in our .

here
See Open AI docs
Tutorial - LLM Evaluation
Quickstart - LLM evaluations
Discord community