Versioning our data set is one of the first and crucial steps in any machine learning pipeline, Version control is a standard procedure in software engineering it allows to keep track of changes overtime, share safely the code between developers, the same need has arisen on the machine learning projects, for more go read https://codegonewild.net/2020/01/11/machine-learning-pipeline:-introduction/(opens in a new tab)
The tool: DVC
DVC is a tool that doesn’t require us any specific installation or additional hardware to enable the versioning of large data sets it has the following features:
Git-compatible: DVC runs on top of any Git repository and is compatible with any standard Git server or provider (GitHub, GitLab, etc). Data file contents can be shared by network-accessible storage or any supported cloud solution. DVC offers all the advantages of a distributed version control system — lock-free, local branching, and versioning.
Storage agnostic: Use Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or rsync to store data. The list of supported remote storage is constantly expanding.
Reproducible: The single ‘dvc repro’ command reproduces experiments end-to-end. DVC guarantees reproducibility by consistently maintaining a combination of input data, configuration, and the code that was initially used to run an experiment.
Low friction branching :DVC fully supports instantaneous Git branching, even with large files. Branches beautifully reflect the non-linear structure and highly iterative nature of a ML process. Data is not duplicated — one file version can belong to dozens of experiments. Create as many experiments as you want, instantaneously switch back and forth, and save a history of all attempts.
Metric tracking :Metrics are first-class citizens in DVC. DVC includes a command to list all branches, along with metric values, to track the progress or pick the best version.
ML pipeline framework: DVC has a built-in way to connect ML steps into a DAG and run the full pipeline end-to-end. DVC handles caching of intermediate results and does not run a step again if input data or code are the same.
Language & framework agnostic: No matter which programming language or libraries are in use or how code is structured, reproducibility and pipelines are based on input and output files or directories. Python, R, Julia, Scala Spark, custom binary, Notebooks, flatfiles/TensorFlow, PyTorch, etc. are all supported.
HDFS, Hive & Apache Spark: Include Spark and Hive jobs in the DVC data versioning cycle along with local ML modeling steps or manage Spark and Hive jobs with DVC end-to-end. Drastically decrease a feedback loop by decomposing a heavy cluster job into smaller DVC pipeline steps. Iterate on the steps independently with respect to dependencies.
Through this article we will try to demonstrate most these features through a use case
Our use case
please follow the document on this https://dvc.org/doc/install, for mac the best option is to use brew
brew install dvc
We will be using the official repository of DVC for this tutorial
Get the dataset and the model
We have a data set and a model that we wanna track , first clone
git clone https://github.com/iterative/example-versioning.git cd example-versioning pip install -r requirements.txt dvc get https://github.com/iterative/dataset-registry tutorial/ver/data.zip unzip -q data.zip rm -f data.zip
The project contains a train.py (model) and a data folder (dataset), lets look closely on th the code :
Line 3: installs requirement for running the model
Line 4: dvc is able to download the dataset given the correct .dvc file
dvc add data python train.py dvc add model.h5 git add .gitignore model.h5.dvc data.dvc metrics.csv git commit -m "First model, trained with 1000 images" git tag -a "v1.0" -m "model v1.0, 1000 images"
Nest step we capture and version the actual state of the data
Line1: we version the data folder containing the training set
Line2: we train the model and get model.h5 (and other files as an output )
Line4: we add all the ouputs to git, see that the .gitignore is updated, it ignores the actual file that was added to dvc in this case (model.h5)
Line6: we tag the dataset to switch later and then push
More information about add https://dvc.org/doc/command-reference/add
Configuring dvc for the remote
export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXX export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXX dvc remote add -d myremote s3://XXXXXXXXXXXXXXXXX dvc push -v
To communicate with a remote object storage that supports an S3 compatible API you must explicitly set the
endpointurl in the configuration
DVC needs AWS style global variables for credentials for it to work
Once the the configuration is set up we can push
add second data set and model adn switching
dvc get https://github.com/iterative/dataset-registry tutorial/ver/new-labels.zip unzip -q new-labels.zip rm -f new-labels.zip git add model.h5.dvc data.dvc metrics.csv git commit -m "Second model, trained with 2000 images" git tag -a "v2.0" -m "model v2.0, 2000 images" git push --follow-tags dvc push -v
just like the first version we add the model and push it
git checkout v1.0 dvc checkout dvc pull
Now this is where the magic happens, we first checkout the git repository to the specific version, this will checkout the .dvc file containing the reference to our files.
dvc checkout checks out the file from the cache and in case the file doesn’t exist we can pull it from the remote …. easy !!
What about the model
Starting from here we will omit dvc push .
what we saw earlier will be suitable for datasets as a dvc file is created for each file we add to track, running a model might have multiple outputs, adding and tracking dvc file for each of them could be
dvc run -f Dvcfile \ -d train.py -d data \ -M metrics.csv \ -o model.h5 -o bottleneck_features_train.npy -o bottleneck_features_validation.npy\ python3 train.py dvc repro
Let’s what are these options:
-f: dvc file name
-d: represent dependencies we put here all the file that would affect a new run
-M: Marking this output as a metric enables us to compare its values across Git tags or branches (for example, representing different experiments).
-o: tracks the output (just like dvc add)
this will guarantee that we have one single file to track all the model.
dvc run gives us the premises of a pipeline system, if the dependencies changes and we use dvc repro it will detect the changes and run the python3 train.py, we can imagine having preprocessing, validation steps run each time the dependencies changes, even better dvc repro will update the cache and track the version