Version Control For Data Science @ PyCon DE & PyData Berlin // October 11th 2019

ALESSIA MARCOLINI
VERSION
CONTROL
FOR DATA
SCIENCE
amarcolini@fbk.eu @viperale

Git Basics
Git is a free and open source
distributed version control system

Git Basics
Create a new repository
Create a new directory, open it
and perform a
git init
to create a new git repository
Checkout a repository
Create a working copy
of a local repository
by running the command
git clone /path/to/repository

Git Basics
Workflow
The local repository consists of three "trees" maintained by git.

Git Basics
Add, Commit & Push
You can propose changes using
git add <filename>
To commit these changes use
git commit -m "Commit message"
To send your changes to your remote repository, execute
git push origin master

Git Basics
Branching
Branches are used to develop features isolated from each other.
Create a new branch named "feature_x" and switch to it using
git checkout -b feature_x
and switch back to master
git checkout master

Git Basics
Update & Merge
To update your local repository to the newest commit, execute
git pull
To merge another branch into your active branch, use
git merge <branch>
To view the changes you made relative to the index, use
git diff [<path>…]

Nothing more than a json file!

Reordering import statements
Changed environment name

Deleted Cell
Moved a cell before
Same code deleted and added

nbdime to the rescue
nbdime provides “content-aware”
diffing and merging of Jupyter notebooks.
It understands the structure of
notebook documents.

Install
pip install nbdime
diff notebooks in your terminal with nbdiff
nbdiff notebook_1.ipynb notebook_2.ipynb
rich web-based rendering of the diff with nbdiff-web
nbdiff-web notebook_1.ipynb notebook_2.ipynb

nbdime config-git --enable --global
Git integration

data.csv original dataset
preprocessed_data.csv rescaled dataset → [0, 1]
preprocessed_data_clean.csv remove incorrect data
preprocessed_data_clean_1.csv remove outliers
What about the data?

Hangar
version control for tensor data
https://github.com/tensorwerk/hangar-py
Dataset
Arrayset Arrayset Arrayset
image filename label
[[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0
[[1,1,1,1], … ,[0,1,1,1]] "image2.png"
[[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1

Dataset
Arrayset Arrayset Arrayset
image filename label
[[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0
[[1,1,1,1], … ,[0,1,1,1]] "image2.png"
[[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
Arbitrary Backend Selection
Arrayset stored in backend optimised
for data of that particular shape / type / layout

Working with data
>>> from hangar import Repository
>>> import numpy as np
>>> repo = Repository(path='path/to/repository')
>>> repo.init(user_name='Alessia Marcolini', user_email='amarcolini@fbk.eu',
remove_old=True)
>>> co = repo.checkout(write=True)
>>> train_images = np.load('mnist_training_images.npy')
>>> co.arraysets.init_arrayset(name='mnist_training_images', prototype=train_images[0])
>>> train_aset = co.arraysets['mnist_training_images']
>>> train_aset['0'] = train_images[0]
>>> train_aset.add(data=train_images[1], name='1')
>>> train_aset[51] = train_images[51]
>>> co.commit('Add mnist dataset')
>>> co.close()

Branching & Merging
>>> dummy = np.arange(10, dtype=np.uint16)
>>> aset = co.arraysets.init_arrayset(name='dummy_arrayset', prototype=dummy)
>>> aset['0'] = dummy
>>> initialCommitHash = co.commit('single sample added to a dummy arrayset')
>>> co.close()
>>> branch_1 = repo.create_branch(name='testbranch')
>>> co = repo.checkout(write=True, branch='testbranch')
>>> co.arraysets['dummy_arrayset']['0']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16)
>>> arr = np.arange(10, dtype=np.uint16)
>>> arr += 1
>>> co['dummy_arrayset', '1'] = arr
>>> co.commit('second sample')
>>> co = repo.checkout(write=True, branch='master')
>>> co.merge(message='message for commit (not used for FF merge)',
dev_branch='testbranch')

Working with remotes
$ hangar server
>>> repo.remote.add('origin', 'localhost:50051')
>>> repo.remote.push('origin', 'master')
>>> cloneRepo = Repository('path/to/repository’)
>>> cloneRepo.clone('Alessia Marcolini',
'amarcolini@fbk.eu',
'localhost:50051',
remove_old=True)
>>> cloneRepo.remote.fetch_data('origin', 'master', 1024)

>>> from hangar import make_tf_dataset
>>> import tensorflow as tf
>>> tf.compat.v1.enable_eager_execution()
>>> repo = Repository('.')
>>> co = repo.checkout()
>>> data = co.arraysets['mnist_data']
>>> target = co.arrayset['mnist_target']
>>> dataset = make_tf_dataset([data, target])
>>> dataset = dataset.batch(512)
>>> for b_data, b_target in dataset:
print(b_data.shape, b_target.shape)
Machine Learning dataloaders
TensorFlow PyTorch
>>> from hangar import make_torch_dataset
>>> from torch.utils.data import DataLoader
>>> repo = Repository('.')
>>> co = repo.checkout()
>>> aset = co.arraysets['dummy_aset']
>>> dataset = make_torch_dataset(aset,
index_range=slice(1,100))
>>> loader = DataLoader(dataset,
batch_size=16)
>>> for batch in loader:
train_mode(batch)

data.csv
preprocessed_data.csv model
preprocessed_data_clean.csv model_1
preprocessed_data_clean_1.csv model_final
model_final_v2
The whole story …
adapted from a graphic by @faviovaz

How to keep track of changes?
How to link code, data, model, metrics?
Are you able to ensure
replicability?

https://github.com/iterative/dvc
DVC is designed to be agnostic of frameworks and languages,
and is designed to run on top of Git repositories.

Link to video: https://www.youtube.com/watch?v=4h6I9_xeYA4

To initialise the dvc repo, run
$ dvc init
Then choose a data remote:
• Local
• AWS S3
• Google Cloud Storage
• Azure Blog Storage
• SSH
• HDFS
• HTTP
Then run
$ dvc remote add -d myremote s3://YOUR_BUCKET_NAME/data

$ dvc add data/data.csv create a data/data.csv.dvc file and
add data/data.csv to .gitignore
$ git add data/data.csv.dvc .gitignore
$ git commit -m "add data"
$ dvc push upload data to remote
Add data to DVC

$ rm -f data/data.csv
$ dvc pull
or
$ dvc pull data/data.csv.dvc
Retrieve data

$ dvc run -f preprocess_data.dvc
-d src/prep.py -d data/data.csv
-o data/preprocessed_data.csv
python src/prep.py data/data.csv preprocessed_data.csv
$ git add preprocess_data.dvc .gitignore
$ git commit -m "add preprocessing stage"
$ dvc push
Connect code and data

$ dvc run -f train.dvc
-d src/train.py -d data/preprocessed_data.csv
-o results/model.pkl
python src/train.py data/preprocessed_data.csv model.pkl
$ git add train.dvc .gitignore
$ git commit -m "add training stage"
$ dvc push
Pipelines

$ dvc repro train.dvc
Reproduce

$ git tag -a "baseline-experiment" -m "Baseline experiment"
$ git checkout baseline-experiment
$ dvc checkout
Tag and go

Thank you
amarcolini@fbk.eu @viperale

Version Control For Data Science @ PyCon DE & PyData Berlin // October 11th 2019

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Version Control For Data Science @ PyCon DE & PyData Berlin // October 11th 2019