Machine learning pipeline: Data validation

Why data validation

In machine learning we are trying to learn from from patterns in data sets and to generalise these learnings, this puts our data sets in the center of the process, and sino quanon condition for the success of the machine learning project.
If our goal is the automation of our machine learning model updates, validating our data is essential, checking for the following:

  • Check for data anomalies
  • Check that the data schema hasn’t changed
  • Check that the statistics of our new data sets still align with statistics from our previous training data sets

if a failure is detected we can stop the workflow address the data issue by hand

TensoFlow Data Validation

TFDV is part of TensorFlow Extended project, it allows us to make the kind of analysis we discussed above, generating schemas and validating the data against the schemas, it also offers visualization based on the Google PAIR project facets

It accepts both CSV or TfRecords, behind the scene it distributes the analysis on Apache Beam

Installation

pip install tensorflow-data-validation

Generating statistics

import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv(
    data_location='reviews_Collectibles_and_Fine_Art.csv',
    delimiter='|')

This will generate the statistics, to display them

tfdv.visualize_statistics(stats)

TFDV generates the following statistics for numerical values :

  • The overall count of data records
  • The percentage of missing data records
  • The mean and standard deviation across the feature
  • The minimum and maximum value across the feature
  • The percentage of zero values across the feature

In addition it generates a histogram of the values for each feature.
For categorical values it generates the following:

  • The overall count of data records
  • The percentage of missing data records
  • The number of unique records
  • The average string length of all records of a feature
  • For each category, TFDV determines the sample count for each label and its rank

Generating Schema from your Data

Data schemata gives the description of your data or the format to expect it in, it can be used to validate future data sets, the schema generated can be used in the next steps, in preprocessing your datasets to convert them

schema = tfdv.infer_schema(stats)
tfdv.display_schema(schema)

Validate newer data

Using the previouslt described schema/domain, it will parse your new set and report outliers, missing or wrong data

This picture describe quite well the process of validation, hen trying to to productify a data ingestion pipeline for your models this is what we try to achieve .
We can load both data nad compare it either visiauly or programatically, determine if there are more feature missing? Does it follow the same schema ?
We can check if there is an issue with eh scehma

anomalies = tfdv.validate_statistics(statistics=val_stats, schema=schema)
tfdv.display_anomalies(anomalies)

Data Skew detection

We speak of data skew if there is a difference between 2 data sets of different types, when we collect new data if doesnt conform to the same schema, feature details or data distribution

Data scehma changes can be detectd easily with TFDV, Feature skew can be detected with the above comparaison in features

We can check change in the data disctribution by comparing serving_statistics against your training data, if it exceeds the threshold TFDV highlights as an anomaly

tfdv.get_feature(schema, 'Sex').skew_comparator.infinity_norm.threshold = 0.000000001
skew_anomalies = tfdv.validate_statistics(
        statistics=stats, schema=schema, serving_statistics=test_stats)
tfdv.display_anomalies(skew_anomalies)

Data Drift Detection

data drift compares 2 data sets of the same type: collected on 2 different days to see if the data drifted from the original data set

tfdv.get_feature(schema, 'Sex').drift_comparator.infinity_norm.threshold = 0.01
drift_anomalies = tfdv.validate_statistics(
        statistics=stats, schema=schema, previous_statistics=test_stats)
tfdv.display_anomalies(drift_anomalies)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s