Load data to pandas
At the moment Evidently works with datasets in Pandas DataFrame format only. These datasets should fit into memory to be processed correctly.
In this tutorial you will see how to load and sample data from other data sources to Pandas DataFrame for further analysis with Evidently.
Tensorflow Datasets
Tensorflow supports conversion from Tensorflow Dataset to Pandas DataFrame with as_dataframe method.
For bigger datasets that do not fit into memory use take for sampling before conversion.
Check that the dataset is shuffled to obtain a representative sample.
import tensorflow_datasets as tfds
MAXIMUM_DATASET_SIZE = 10000 # set up the maximum number of lines in your sample
# tensorflow_ds is a shuffled Tensorflow Dataset
pandas_df = tfds.as_dataframe(tensorflow_ds.take(MAXIMUM_DATASET_SIZE))Note that
as_dataframemethod loads everything in memory, make sure to run it on a sample from your dataset to control for its size
Pytorch Datapipes
To sample data from Pytorch Datapipes shuffle it first with shuffle() and take the first batch of the chosen size.
This sample can be converted to Pandas DataFrame
See example with AG News dataset:
import pandas as pd
from torchdata.datapipes.iter import HttpReader
MAXIMUM_DATASET_SIZE = 10000 # set up the maximum number of lines in your sample
# Load data to Pytorch Datapipe
URL = "https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv"
ag_news_train = HttpReader([URL]).parse_csv().map(lambda t: (int(t[0]), " ".join(t[1:])))
# Shuffle and sample data
batches = ag_news_train.shuffle().batch(MAXIMUM_DATASET_SIZE)
sample = next(iter(batches))
# Load sampled data to Pandas DataFrame
pandas_df = pd.DataFrame({'text': [el[1] for el in sample],
'label': [el[0] for el in sample]})Note that resulting Pandas DataFrame schema is arbitrary, just make sure to specify text and target columns withcolumn_mapping later
PySpark DataFrames
PySpark supports conversion to Pandas DataFrame with toPandas() method.
For bigger DataFrames that do not fit into memory use sample for sampling before conversion.
You can ensure that sampling provides the same result each run by passing a fixed seed value to sample method
Files in a directory
If your data is organized in separate files for each text with folder names corresponding to class labels, like so:
use the following steps to sample data preserving the balance of classes:
Last updated