Data is the key differentiator between a Machine Learning project and a traditional software project: even if everything else stays stable, changing the data your models are trained upon makes a huge difference.
The best tools for tracking changes are the VCS that are used in software development, such as Git, Mercurial, and Subversion. They keep track of what was changed in a file, when and by whom, and synchronize changes to a central server so that multiple contributors can manage changes to the same set of files. But these traditional tools aren’t quite sufficient for Machine Learning because of the need for being able to track the data sets along with the code itself and some of the resulting models.
So versioning in Data Science projects can be pretty painful. There are generally six things that you usually want to keep track of:
code
data
configurations
resulting models
performance metrics
environments / dependencies
Running a Data Science project is an iterative process and you usually don’t want to commit changes every time you change one parameter or one performance metric. Instead, you'll run a variety of experiments and commit it once you’re satisfied.
This usually means that during the experimentation process, you might lose track of any of the experiments that you did (e.g. changes on data or dependencies). However, when you share your results with your colleagues, they'll not have any ideas of what you've already tried and most likely will end up redoing a bunch of work — heck, after a couple of weeks you could end up doing the same.
In this talk I will share some best practices to help you better version your ML project and also I will show some existing tools such as DVC, ndim and ReviewNB (to version Jupyter Notebooks).
This talk is aimed at PyData beginners and specific Machine Learning expertise is not required, although knowledge about Git and the Data Science ecosystem would help follow the speech.
2. Git Basics
Git is a free and open source
distributed version control system
3. Git Basics
Create a new repository
Create a new directory, open it
and perform a
git init
to create a new git repository
Checkout a repository
Create a working copy
of a local repository
by running the command
git clone /path/to/repository
5. Git Basics
Add, Commit & Push
You can propose changes using
git add <filename>
To commit these changes use
git commit -m "Commit message"
To send your changes to your remote repository, execute
git push origin master
6. Git Basics
Branching
Branches are used to develop features isolated from each other.
Create a new branch named "feature_x" and switch to it using
git checkout -b feature_x
and switch back to master
git checkout master
7. Git Basics
Update & Merge
To update your local repository to the newest commit, execute
git pull
To merge another branch into your active branch, use
git merge <branch>
To view the changes you made relative to the index, use
git diff [<path>…]
17. nbdime to the rescue
nbdime provides “content-aware”
diffing and merging of Jupyter notebooks.
It understands the structure of
notebook documents.
18. Install
pip install nbdime
diff notebooks in your terminal with nbdiff
nbdiff notebook_1.ipynb notebook_2.ipynb
rich web-based rendering of the diff with nbdiff-web
nbdiff-web notebook_1.ipynb notebook_2.ipynb
23. data.csv original dataset
preprocessed_data.csv rescaled dataset → [0, 1]
preprocessed_data_clean.csv remove incorrect data
preprocessed_data_clean_1.csv remove outliers
What about the data?
24.
25. Hangar
version control for tensor data
https://github.com/tensorwerk/hangar-py
Dataset
Arrayset Arrayset Arrayset
image filename label
[[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0
[[1,1,1,1], … ,[0,1,1,1]] "image2.png"
[[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
26. Dataset
Arrayset Arrayset Arrayset
image filename label
[[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0
[[1,1,1,1], … ,[0,1,1,1]] "image2.png"
[[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
Arbitrary Backend Selection
Arrayset stored in backend optimised
for data of that particular shape / type / layout
27. Working with data
>>> from hangar import Repository
>>> import numpy as np
>>> repo = Repository(path='path/to/repository')
>>> repo.init(user_name='Alessia Marcolini', user_email='amarcolini@fbk.eu',
remove_old=True)
>>> co = repo.checkout(write=True)
>>> train_images = np.load('mnist_training_images.npy')
>>> co.arraysets.init_arrayset(name='mnist_training_images', prototype=train_images[0])
>>> train_aset = co.arraysets['mnist_training_images']
>>> train_aset['0'] = train_images[0]
>>> train_aset.add(data=train_images[1], name='1')
>>> train_aset[51] = train_images[51]
>>> co.commit('Add mnist dataset')
>>> co.close()
35. Link to video: https://www.youtube.com/watch?v=4h6I9_xeYA4
36. To initialise the dvc repo, run
$ dvc init
Then choose a data remote:
• Local
• AWS S3
• Google Cloud Storage
• Azure Blog Storage
• SSH
• HDFS
• HTTP
Then run
$ dvc remote add -d myremote s3://YOUR_BUCKET_NAME/data
37. $ dvc add data/data.csv create a data/data.csv.dvc file and
add data/data.csv to .gitignore
$ git add data/data.csv.dvc .gitignore
$ git commit -m "add data"
$ dvc push upload data to remote
Add data to DVC
38. $ rm -f data/data.csv
$ dvc pull
or
$ dvc pull data/data.csv.dvc
Retrieve data