LogoLogo
HomeBlogGitHub
latest
latest
  • New DOCS
  • What is Evidently?
  • Get Started
    • Evidently Cloud
      • Quickstart - LLM tracing
      • Quickstart - LLM evaluations
      • Quickstart - Data and ML checks
      • Quickstart - No-code evaluations
    • Evidently OSS
      • OSS Quickstart - LLM evals
      • OSS Quickstart - Data and ML monitoring
  • Presets
    • All Presets
    • Data Drift
    • Data Quality
    • Target Drift
    • Regression Performance
    • Classification Performance
    • NoTargetPerformance
    • Text Evals
    • Recommender System
  • Tutorials and Examples
    • All Tutorials
    • Tutorial - Tracing
    • Tutorial - Reports and Tests
    • Tutorial - Data & ML Monitoring
    • Tutorial - LLM Evaluation
    • Self-host ML Monitoring
    • LLM as a judge
    • LLM Regression Testing
  • Setup
    • Installation
    • Evidently Cloud
    • Self-hosting
  • User Guide
    • 📂Projects
      • Projects overview
      • Manage Projects
    • 📶Tracing
      • Tracing overview
      • Set up tracing
    • 🔢Input data
      • Input data overview
      • Column mapping
      • Data for Classification
      • Data for Recommendations
      • Load data to pandas
    • 🚦Tests and Reports
      • Reports and Tests Overview
      • Get a Report
      • Run a Test Suite
      • Evaluate Text Data
      • Output formats
      • Generate multiple Tests or Metrics
      • Run Evidently on Spark
    • 📊Evaluations
      • Evaluations overview
      • Generate snapshots
      • Run no code evals
    • 🔎Monitoring
      • Monitoring overview
      • Batch monitoring
      • Collector service
      • Scheduled evaluations
      • Send alerts
    • 📈Dashboard
      • Dashboard overview
      • Pre-built Tabs
      • Panel types
      • Adding Panels
    • 📚Datasets
      • Datasets overview
      • Work with Datasets
    • 🛠️Customization
      • Data drift parameters
      • Embeddings drift parameters
      • Feature importance in data drift
      • Text evals with LLM-as-judge
      • Text evals with HuggingFace
      • Add a custom text descriptor
      • Add a custom drift method
      • Add a custom Metric or Test
      • Customize JSON output
      • Show raw data in Reports
      • Add text comments to Reports
      • Change color schema
    • How-to guides
  • Reference
    • All tests
    • All metrics
      • Ranking metrics
    • Data drift algorithm
    • API Reference
      • evidently.calculations
        • evidently.calculations.stattests
      • evidently.metrics
        • evidently.metrics.classification_performance
        • evidently.metrics.data_drift
        • evidently.metrics.data_integrity
        • evidently.metrics.data_quality
        • evidently.metrics.regression_performance
      • evidently.metric_preset
      • evidently.options
      • evidently.pipeline
      • evidently.renderers
      • evidently.report
      • evidently.suite
      • evidently.test_preset
      • evidently.test_suite
      • evidently.tests
      • evidently.utils
  • Integrations
    • Integrations
      • Evidently integrations
      • Notebook environments
      • Evidently and Airflow
      • Evidently and MLflow
      • Evidently and DVCLive
      • Evidently and Metaflow
  • SUPPORT
    • Migration
    • Contact
    • F.A.Q.
    • Telemetry
    • Changelog
  • GitHub Page
  • Website
Powered by GitBook
On this page
  • Code example
  • Sample models
  1. User Guide
  2. Customization

Text evals with HuggingFace

How to use models available on HuggingFace as text Descriptors.

PreviousText evals with LLM-as-judgeNextAdd a custom text descriptor

Last updated 1 month ago

You are looking at the old Evidently documentation: this API is available with versions 0.6.7 or lower. Check the newer docs version .

Pre-requisites:

  • You know how to generate Reports or Test Suites for text data using Descriptors.

  • You know how to pass custom parameters for Reports or Test Suites.

  • You know to specify text data in column mapping.

You can use an external machine learning model to score text data. This method lets you evaluate texts based on any criteria from the source model, e.g. classify it into a set number of labels.

The model you use must return a numerical score or a category for each text in a column. You will then be able to view scores, analyze their distribution or run conditional tests through the usual Descriptor interface.

Evidently supports using HuggingFace models: use the general HuggingFaceModel() descriptor to select models on your own or simplified interfaces like HuggingFaceToxicityModel().

Code example

You can refer to an end-to-end example with different Descriptors:

To import the Descriptor:

from evidently.descriptors import HuggingFaceModel, HuggingFaceToxicityModel

To get a Report with a Toxicity score for the response column:

report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        HuggingFaceToxicityModel(toxic_label="hate"),
    ])
])

To get a Report with with several different scores using the general HuggingFaceModel() descriptor:

report = Report(metrics=[
    TextEvals(column_name="response", descriptors=[
        HuggingFaceModel(model="DaNLP/da-electra-hatespeech-detection", display_name="Response Toxicity"),
        HuggingFaceModel(model="SamLowe/roberta-base-go_emotions", params={"label": "disappointment"}, 
                         display_name="Disappointments in Response"), 
        HuggingFaceModel(model="SamLowe/roberta-base-go_emotions", params={"label": "optimism"}, 
                         display_name="Optimism in Response"),     
    ])
])

You can do the same for Test Suites.

Sample models

Here are some example models you can call using the HuggingFaceModel() descriptor.

Model
Parameters

Emotion classification SamLowe/roberta-base-go_emotions

  • Scores texts by 28 emotions.

  • Returns the predicted probability for the chosen emotion label.

  • Scale: 0 to 1.

  • toxic_label="hate" (default)

Required:

  • params={"label":"label"}

Available labels:

  • admiration

  • amusement

  • anger

  • annoyance

  • approval

  • caring

  • confusion

  • curiosity

  • desire

  • disappointment

  • disapproval

  • disgust

  • embarrassment

  • excitement

  • fear

  • gratitude

  • grief

  • joy

  • love

  • nervousness

  • optimism

  • pride

  • realization

  • relief

  • remorse

  • sadness

  • surprise

  • neutral

Optional:

  • display_name="display name"

Toxicity detection facebook/roberta-hate-speech-dynabench-r4-target

  • Detects hate speech.

  • Returns predicted probability for the “hate” label.

  • Scale: 0 to 1.

Optional:

  • toxic_label="hate" (default)

  • display_name="display name"

Zero-shot classification MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli

  • A natural language inference model.

  • Use it for zero-shot classification by user-provided topics.

  • List candidate topics as labels. You can provide one or several topics.

  • You can set a classification threshold: if the predicted probability is below, an "unknown" label will be assigned.

  • Returns a label.

Required:

  • params={"labels": ["label"]}

Optional:

  • params={"score_threshold": 0.7} (default: 0.5)

  • display_name="display name"

GPT-2 text detection openai-community/roberta-base-openai-detector

  • Predicts if a text is Real or Fake (generated by a GPT-2 model).

  • You can set a classification threshold: if the predicted probability is below, an "unknown" label will be assigned.

  • Note that it is not usable as a detector for more advanced models like ChatGPT.

  • Returns a label.

Optional:

  • params={"score_threshold": 0.7} (default: 0.5)

  • display_name="display name"

This list is not exhaustive, and the Descriptor may support other models published on Hugging Face. The implemented interface generally works for models that:

  • Output a single number (e.g., predicted score for a label) or a label, not an array of values.

  • Can process raw text input directly.

  • Name labels using label or labels fields.

  • Use methods named predict or predict_proba for scoring.

However, since each model is implemented differently, we cannot provide a complete list of models with a compatible interface. We suggest testing the implementation on your own using trial and error. If you discover useful models, feel free to share them with the community in Discord. You can also open an issue on GitHub to request support for a specific model.

Which descriptors are there? See the list of available built-in descriptors in the page.

Example use: HuggingFaceModel(model="SamLowe/roberta-base-go_emotions", params={"label": "disappointment"}) Source:

Example use: HuggingFaceModel(model="facebook/roberta-hate-speech-dynabench-r4-target", display_name="Toxicity") Source:

Example use: HuggingFaceModel(model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli", params={"labels": ["HR", "finance"], "threshold":0.5}, display_name="Topic") Source:

Example use: HuggingFaceModel(model="openai-community/roberta-base-openai-detector", params={"score_threshold": 0.7}) Source:

🛠️
All Metrics
HuggingFace Model
HuggingFace Model
HuggingFace Model
HuggingFace Model
here
evidently/examples/how_to_questions/how_to_evaluate_llm_with_text_descriptors.ipynb at ad71e132d59ac3a84fce6cf27bd50b12b10d9137 · evidentlyai/evidentlyGitHub
Logo