Classification Performance
Last updated
Last updated
TL;DR: You can use the pre-built Reports and Test suites to analyze the performance of a classification model. The Presets work for binary and multi-class classification, probabilistic and non-probabilistic classification.
Report: for visual analysis or metrics export, use the ClassificationPreset
.
Test Suite: for pipeline checks, use the MulticlassClassificationTestPreset
, BinaryClassificationTopKTestPreset
or BinaryClassificationTestPreset
.
These presets help evaluate and test the quality of classification models. You can use them:
1. To monitor the performance of a classification model in production. You can run the test suite as a regular job (e.g., weekly or when you get the labels) to contrast the model performance against the expectation. You can generate visual reports for documentation and sharing with stakeholders.
2. To trigger or decide on the model retraining. You can use the test suite to check if the model performance is below the threshold to initiate a model update.
3. To debug or improve model performance. If you detect a quality drop, you can use the visual report to explore the model errors and underperforming segments. By manipulating the input data frame, you can explore how the model performs on different data segments (e.g., users from a specific region). You can also combine it with the report.
4. To analyze the results of the model test. You can explore the results of an online or offline test and contrast it to the performance in training. You can also use this report to compare the model performance in an A/B test or during a shadow model deployment.
To run performance checks as part of the pipeline, use the Test Suite. To explore and debug, use the Report.
If you want to visually explore the model performance, create a new Report object and include the ClassificationPreset
.
This report evaluates the quality of a classification model.
Can be generated for a single dataset, or compare it against the reference (e.g. past performance or alternative model).
Works for binary and multi-class, probabilistic and non-probabilistic classification.
Displays a variety of metrics and plots related to the model performance.
Helps explore regions where the model makes different types of errors.
To run this report, you need to have both target and prediction columns available. Input features are optional. Pass them if you want to explore the relations between features and target.
The tool does not yet work for multi-label classification. It expects a single true label.
To generate a comparative report, you will need two datasets.
You can also run this report for a single dataset, with no comparison performed.
The report includes multiple components. The composition might vary based on problem type (there are more plots in the case of probabilistic classification). All plots are interactive.
Evidently calculates a few standard model quality metrics: Accuracy, Precision, Recall, F1-score, ROC AUC, and LogLoss.
To support the model performance analysis, Evidently also generates interactive visualizations. They help analyze where the model makes mistakes and come up with improvement ideas.
Shows the number of objects of each class.
Visualizes the classification errors and their type.
Shows the model quality metrics for the individual classes. In the case of multi-class problems, it will also include ROC AUC.
A scatter plot of the predicted probabilities shows correct and incorrect predictions for each class.
It serves as a representation of both model accuracy and the quality of its calibration. It also helps visually choose the best probability threshold for each class.
A similar view as above, it shows the distribution of predicted probabilities.
ROC Curve (receiver operating characteristic curve) shows the share of true positives and true negatives at different classification thresholds.
The precision-recall curve shows the trade-off between precision and recall for different classification thresholds.
The table shows possible outcomes for different classification thresholds and prediction coverage. If you have two datasets, the table is generated for both.
Each line in the table defines a case when only top-X% predictions are considered, with a 5% step. It shows the absolute number of predictions (Count) and the probability threshold (Prob) that correspond to this combination.
The table then shows the quality metrics for a given combination. It includes Precision, Recall, the share of True Positives (TP), and False Positives (FP).
This helps explore the quality of the model if you choose to act only on some of the predictions.
In this table, we show a number of plots for each feature. To expand the plots, click on the feature name.
In the tab “ALL”, you can see the distribution of classes against the values of the feature. If you compare the two datasets, it visually shows the changes in the feature distribution and in the relationship between the values of the feature and the target.
For each class, you can see the predicted probabilities alongside the values of the feature.
It visualizes the regions where the model makes errors of each type and reveals the low-performance segments. You can compare the distributions and see if the errors are sensitive to the values of a given feature.
You can get the report output as a JSON or a Python dictionary:
You can perform the analysis of relations between features and target only for selected columns.
If you want to run classification performance checks as part of a pipeline, you can create a Test Suite and use one of the classification presets. There are several presets for different classification tasks. They apply to Multiclass Classification, Binary Classification, and Binary Classification at topK accordingly:
You can use the test presets to evaluate the quality of a classification model when you have the ground truth labels.
Each preset compares relevant quality metrics for the model type and against the defined expectation.
They also test for the target drift to detect shift in the distribution of classes and/or probabilities. It might indicate emerging concept drift.
For Evidently to generate the test conditions automatically, you should pass the reference dataset (e.g., performance during model validation or a previous period). You can also set the performance expectations manually by passing a custom test condition.
If you neither pass the reference dataset nor set custom test conditions, Evidently will compare the model performance to a dummy model.
You can set custom test conditions.
Refer to the to see how to pass model predictions and labels in different cases.
Aggregated visuals in plots. Starting from v 0.3.2, all visuals in the Evidently Reports are aggregated by default. This helps decrease the load time and report size for larger datasets. If you work with smaller datasets or samples, you can pass an . You can choose whether you want it on not based on the size of your dataset.
You can pass relevant parameters to change the way some of the metrics are calculated, such as decision threshold or K to evaluate precision@K. See the available parameters
You can use a .
If you want to exclude some of the metrics, you can create a custom report by combining the chosen metrics. See the complete list
Head here to the table to see the composition of each preset and default parameters.
You can pass relevant parameters to change how some of the metrics are calculated, such as classification decision threshold or K to evaluate precision@K. See the .
If you want to exclude some tests or add additional ones, you can create a custom test suite by combining the chosen tests. See the complete list .
Browse the for sample Jupyter notebooks and Colabs.
See a blog post and a tutorial "" where we analyze the performance of two models with identical ROC AUC to choose between the two.