Machine learning pipeline: Data and model versioning

Since that day the DS is reported to be missing

Versioning our data set is one of the first and crucial steps in any machine learning pipeline, Version control is a standard procedure in software engineering it allows to keep track of changes overtime, share safely the code between developers, the same need has arisen on the machine learning projects, for more go read https://codegonewild.net/2020/01/11/machine-learning-pipeline:-introduction/‎(opens in a new tab)

The tool: DVC

DVC is a tool that doesn’t require us any specific installation or additional hardware to enable the versioning of large data sets it has the following features:

Git-compatible: DVC runs on top of any Git repository and is compatible with any standard Git server or provider (GitHub, GitLab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.

Storage agnostic: Use Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or rsync to store data. The list of supported remote storage is constantly expanding.

Reproducible: The single ‘dvc repro’ command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.

Low friction branching :DVC fully supports instantaneous Git branching, even with large files. Branches beautifully reflect the non-linear structure and highly iterative nature of a ML process. Data is not duplicated — one file version can belong to dozens of experiments. Create as many experiments as you want, instantaneously switch back and forth, and save a history of all attempts.

Metric tracking :Metrics are first-class citizens in DVC. DVC includes a command to list all branches, along with metric values, to track the progress or pick the best version.

ML pipeline framework: DVC has a built-in way to connect ML steps into a DAG and run the full pipeline end-to-end. DVC handles caching of intermediate results and does not run a step again if input data or code are the same.

Language & framework agnostic: No matter which programming language or libraries are in use or how code is structured, reproducibility and pipelines are based on input and output files or directories. Python, R, Julia, Scala Spark, custom binary, Notebooks, flatfiles/TensorFlow, PyTorch, etc. are all supported.

HDFS, Hive & Apache Spark: Include Spark and Hive jobs in the DVC data versioning cycle along with local ML modeling steps or manage Spark and Hive jobs with DVC end-to-end. Drastically decrease a feedback loop by decomposing a heavy cluster job into smaller DVC pipeline steps. Iterate on the steps independently with respect to dependencies.

Through this article we will try to demonstrate most these features through a use case

Our use case

Installation

please follow the document on this https://dvc.org/doc/install, for mac the best option is to use brew

brew install dvc

We will be using the official repository of DVC for this tutorial

Get the dataset and the model

We have a data set and a model that we wanna track , first clone

git clone https://github.com/iterative/example-versioning.git
cd example-versioning
pip install -r requirements.txt
dvc get https://github.com/iterative/dataset-registry  tutorial/ver/data.zip
unzip -q data.zip
rm -f data.zip

The project contains a train.py (model) and a data folder (dataset), lets look closely on th the code :
Line 3: installs requirement for running the model
Line 4: dvc is able to download the dataset given the correct .dvc file

dvc add data
python train.py
dvc add model.h5
git add .gitignore model.h5.dvc data.dvc metrics.csv
git commit -m "First model, trained with 1000 images"
git tag -a "v1.0" -m "model v1.0, 1000 images"

Nest step we capture and version the actual state of the data
Line1: we version the data folder containing the training set
Line2: we train the model and get model.h5 (and other files as an output )
Line4: we add all the ouputs to git, see that the .gitignore is updated, it ignores the actual file that was added to dvc in this case (model.h5)
Line6: we tag the dataset to switch later and then push
More information about add https://dvc.org/doc/command-reference/add

Configuring dvc for the remote

export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXX
export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXX
dvc remote add -d myremote s3://XXXXXXXXXXXXXXXXX
dvc push -v

To communicate with a remote object storage that supports an S3 compatible API you must explicitly set the endpointurl in the configuration
DVC needs AWS style global variables for credentials for it to work
Once the the configuration is set up we can push

add second data set and model adn switching

dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip
unzip -q new-labels.zip
rm -f new-labels.zip
git add model.h5.dvc data.dvc metrics.csv
git commit -m "Second model, trained with 2000 images"
git tag -a "v2.0" -m "model v2.0, 2000 images"
git push --follow-tags
dvc push -v

just like the first version we add the model and push it

git checkout v1.0
dvc checkout
dvc pull

Now this is where the magic happens, we first checkout the git repository to the specific version, this will checkout the .dvc file containing the reference to our files.

dvc checkout checks out the file from the cache and in case the file doesn’t exist we can pull it from the remote …. easy !!

What about the model

Starting from here we will omit dvc push .
what we saw earlier will be suitable for datasets as a dvc file is created for each file we add to track, running a model might have multiple outputs, adding and tracking dvc file for each of them could be

dvc run -f Dvcfile \
          -d train.py -d data \
          -M metrics.csv \
          -o model.h5 -o bottleneck_features_train.npy -o bottleneck_features_validation.npy\
          python3 train.py
dvc repro 

Let’s what are these options:
-f: dvc file name
-d: represent dependencies we put here all the file that would affect a new run
-M: Marking this output as a metric enables us to compare its values across Git tags or branches (for example, representing different experiments).
-o: tracks the output (just like dvc add)
this will guarantee that we have one single file to track all the model.
dvc run gives us the premises of a pipeline system, if the dependencies changes and we use dvc repro it will detect the changes and run the python3 train.py, we can imagine having preprocessing, validation steps run each time the dependencies changes, even better dvc repro will update the cache and track the version


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s