Machine learning pipeline: Introduction

Introduction to ML pipelines

During the last few years the development of the field of machine learning has seen great achievements, deep learning concepts, transformers and the suage of GPU
Most of people rush to get the most performant machine learning model, lacking a good approach and tools to accelerate, reuse manage and deploy the developments, standard are needed

Machine learning pipelines are processes to accelerate, reuse and deploy machine learning models. Software engineering went through the same process a decade ago with the CI/CD, and DS can learn from it

DS and ML teams dont have the luxury to include big teams to deploy models and makes it difficult to build an entire pipeline in house from scratch, it may mean that DS projects will turn into a one off efforts where the performance degrades, DS soends most of the time writing fixes when the underlying data changes or the model is not used widely

We need to outline process to
– Version data effectively and kick off a new model training run
-Efficiently pre-process data for your model training and validation
-Version control your model checkpoints during training
-Track your model training experiments
-Analyse and validate the trained and tuned models
-Deploy the validated model
-Scale the deployed model
-Capture new training data and model performance metrics with feedback loops

Overview of machine learning pipeline

ML pipeline starts with the collection of new training data and some kind of feedback on how your trained model is performing, as much as data pre-processing and model training and analysis, the goal is automatic the whole process and thus minimising human error

ML pipeline is actually a recurring cycle, data can be continuously collected and therefore modes can be updated, automation is the key as doing it manually might consume a lot of time
A model life contains:

Experiment tracking :

When DS optimise machine learning models, they evaluate various model type, model architecture, model architecture, hyper-parameter and data sets
While manually optimising or automatically tuning the model capturing and sharing the results of your optimisation process is essential
it can also be used track edge cases and model parameters and iterations

data versioning:

The beginning of the model life cycle, when new data is available a snapshot of the data will be version controlled and it can kick off a new cycle, just as software versioning except we check in model training and validation data

data validation:

Before training a new model version, we need to validate the new data, checking abnormalities using statistics and the split between the data

data pre-processing:

Most likely the data cannot be used raw to train the model, some modification will be required be fore using them in your training run:
labels need to be converted to one or multi-hot vector, same for model inputs for nlp you might need tokenisation
Variety of tools were created, while DS prefer to focus on the processing capabilities of the preferred tools, it is important that the modification of the pre processing steps can be linked to the processed data , if anyone modifies a processing step, the previous training data should be invalid and force an update the hole pipeline

Model training and tuning:

Core of the of the ML pipeline, we train the model to take inputs and predict an output with the lowest error possible, with large model and data set it can become difficult to manage, since memory is a finite resource.
Model tuning has see a great deal of attention because it can yield significant performance improvement, in the previous step we assumed we would do one training run but how pick the optimal model architecture and hyper-parameter
With today DevOps tools we can replicate machine models and their training setups effortless, it gives us the opportunity to spin up large number of models in parallel or in sequence .
This tuning can be automated (different hyper-parameter, number of layer), choice of parameter values can be either be based on a grid search or a probabilistic approach .

Model analysis:

Once we have determined the most optimal set of parameters which grants the best performance, we need to tot analyse its performance before deploying the model to our environment .
during these steps we are validating the models against an unseen analysis data which are not a subset of the previous training and validation dataset , we will expose the model to a small variation of the analysis dataset and measure how sensitive is the model predictions are against small variations. At the same time, analysis tools measure if the model predicts dominantly for one label for a subset section of the dataset .
The main reason in favor of a proper analysis is the bias against a subsection of the data might get lost in the validation process while training the model, the model accuracy against the validation set is often calculated as an average over the entire data set ignoring bias .
just like the tuning step a data scientist needs to review, we can automate the whole process though

Model Versioning:

The purpose of model versioning is to keep track of which model set of hyper-parameter and a data set have been selected as the next version of the model
Semantic versioning in software engineering require to do an increase of the major version number when you make an incompatible change in your API, otherwise we increase the minor version , model release has another degree of freedom : the dataset
there are situation where you can modify the performance without changing the parameters or the architecture but by providing more data for training, is it a an increase for the major version
depends !! but must document it

Model deployment:

once trained, tuned and analysed its ready for the prime time, sadly too many models are deployed with one off implementation which makes updating a brittle process .
Some model servers have been open sourcedin the last few years which allow effecient deployments. Modern model servers allow you to deploy your model without writing web a code. often the provide you with multiple API interface like REST or RPCand allow you to host multiple version fo the same model simultaniously to run A/B testing

Feedback loop:

We ned to close the loop and mesure the effectiveness and performance of the newly deployed model.
We can capture valuable information about the performance of the model and capture new training data to increase our data sets to update our model and create a new version
Beside the 2 manual review, we can automate the whole lifecycle .
DS should focus on the creation of new models not updating and maintaining existing models

One comment

  1. […] Versioning our data set is one of the first and crucial steps in any machine learning pipeline, Version control is a standard procedure in software engineering it allows to keep track of changes overtime, share safely the code between developers, the same need has arisen on the machine learning projects, for more go read https://codegonewild.net/2020/01/11/machine-learning-pipeline:-introduction/‎(opens in a new tab) […]

    Like

Leave a Reply to Machine learning pipeline: Data and model versioning – Scalator Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s