# Input data overview

{% hint style="info" %}
**You are looking at the old Evidently documentation**: this API is available with versions 0.6.7 or lower. Check the newer version [here](https://docs.evidentlyai.com/introduction).
{% endhint %}

To run evaluations on your datasets with the Evidently Python library, you should prepare your data in a certain way. This section covers how to do that.&#x20;

{% hint style="success" %}
This applies to `Evidently OSS`, `Evidently Cloud` and `Evidently Enterprise`.
{% endhint %}

{% hint style="info" %}
**Looking for something else?** Check [Tracing](/user-guide/tracing/tracing_overview.md) to instrument your app. Check [Datasets](/user-guide/datasets/datasets_overview.md) to work with datasets in the user interface. To run evaluations after you prepare the data, see [Reports and Test Suites](/user-guide/tests-and-reports/introduction.md).
{% endhint %}

## Input data format

Evidently works with Pandas DataFrames, with some metrics also supported on [Spark](/user-guide/tests-and-reports/spark.md).

Your input data should be in **tabular** format. All column names must be strings. The data can include any numerical, categorical, text, DateTime, and ID columns. You can pass embeddings as numerical features.&#x20;

The structure is flexible. For example, you can pass:

* **Any tabular dataset**. You can run checks for data quality and drift for any dataset.
* **Logs of generative LLM application**. Include text inputs, outputs, and metadata.
* **ML model inferences**. You can analyze prediction logs that include model features (numerical, categorical, embeddings), predictions, and optional target values.&#x20;

To run certain evaluations, you must include specific columns. For instance, to evaluate classification quality, you need columns with predicted and actual labels. These should be named "prediction" and "target", or you’ll need to point to the columns that contain them. This process is called **Column Mapping**.

Learn more in the next section:

{% content-ref url="/pages/kFm7UkwawZ1ULAt8md8k" %}
[Column mapping](/user-guide/input-data/column-mapping.md)
{% endcontent-ref %}

## Reference and current data

Usually, you evaluate a single dataset, which we call the **current** dataset. In some cases, you might also use a second dataset, known as the **reference** dataset. You pass them both when running an evaluation.

![](/files/zNtz5m7gdjdDe5nuPV3Z)

When you may need two datasets:

* **Side-by-side comparison**. If you want to compare model performance or data quality over two different periods or between model versions, you can do this inside one Report. Pass one dataset as `current`, and another as `reference`.
* **Data drift detection**. To detect distribution shifts, you compare two datasets using methods like distance metrics. You always need two datasets. Use your latest production batch as `current`, and choose a `reference` dataset to compare against, such as your validation data or an earlier production batch.
* **Automatic Test generation**. If you provide a `reference` dataset, Evidently can automatically set up Test conditions, like expected min-max values for specific columns. This way, you don’t have to write each test condition manually.

If you pass two datasets, the structure of both datasets should be identical.&#x20;

## Data volume

Running computationally intensive evaluations on large datasets can take time. This depends on the specific evaluation as well as your infrastructure.

In many cases, such as for probabilistic data drift detection, it’s more efficient to work with **samples** of your data. For instance, instead of running drift detection on millions of rows, you can apply random or stratified sampling and then compare samples of your data. &#x20;

For datasets that don’t fit in memory, you can run calculations using [Spark](/user-guide/tests-and-reports/spark.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs-old.evidentlyai.com/user-guide/input-data/data-requirements.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
