SlideShare a Scribd company logo
1 of 44
Download to read offline
ALESSIA MARCOLINI
VERSION
CONTROL
FOR DATA
SCIENCE
amarcolini@fbk.eu @viperale
Git Basics
Git is a free and open source 
distributed version control system
Git Basics
Create a new repository
Create a new directory, open it
and perform a
git init
to create a new git repository
Checkout a repository
Create a working copy
of a local repository
by running the command
git clone /path/to/repository
Git Basics
Workflow
The local repository consists of three "trees" maintained by git.
Git Basics
Add, Commit & Push
You can propose changes using
git add <filename>
To commit these changes use
git commit -m "Commit message"
To send your changes to your remote repository, execute
git push origin master
Git Basics
Branching
Branches are used to develop features isolated from each other.
Create a new branch named "feature_x" and switch to it using
git checkout -b feature_x
and switch back to master
git checkout master
Git Basics
Update & Merge
To update your local repository to the newest commit, execute
git pull
To merge another branch into your active branch, use
git merge <branch>
To view the changes you made relative to the index, use
git diff [<path>…]
Nothing more than a json file!
Reordering import statements
Changed environment name
Deleted Cell
Moved a cell before
Same code deleted and added
nbdime to the rescue
nbdime provides “content-aware”
diffing and merging of Jupyter notebooks.
It understands the structure of
notebook documents.
Install
pip install nbdime
diff notebooks in your terminal with nbdiff
nbdiff notebook_1.ipynb notebook_2.ipynb
rich web-based rendering of the diff with nbdiff-web
nbdiff-web notebook_1.ipynb notebook_2.ipynb
nbdime config-git --enable --global
Git integration
data.csv original dataset
preprocessed_data.csv rescaled dataset → [0, 1]
preprocessed_data_clean.csv remove incorrect data
preprocessed_data_clean_1.csv remove outliers
What about the data?
Hangar
version control for tensor data
https://github.com/tensorwerk/hangar-py
Dataset
Arrayset Arrayset Arrayset
image filename label
[[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0
[[1,1,1,1], … ,[0,1,1,1]] "image2.png"
[[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
Dataset
Arrayset Arrayset Arrayset
image filename label
[[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0
[[1,1,1,1], … ,[0,1,1,1]] "image2.png"
[[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
Arbitrary Backend Selection
Arrayset stored in backend optimised
for data of that particular shape / type / layout
Working with data
>>> from hangar import Repository
>>> import numpy as np
>>> repo = Repository(path='path/to/repository')
>>> repo.init(user_name='Alessia Marcolini', user_email='amarcolini@fbk.eu',
remove_old=True)
>>> co = repo.checkout(write=True)
>>> train_images = np.load('mnist_training_images.npy')
>>> co.arraysets.init_arrayset(name='mnist_training_images', prototype=train_images[0])
>>> train_aset = co.arraysets['mnist_training_images']
>>> train_aset['0'] = train_images[0]
>>> train_aset.add(data=train_images[1], name='1')
>>> train_aset[51] = train_images[51]
>>> co.commit('Add mnist dataset')
>>> co.close()
Branching & Merging
>>> dummy = np.arange(10, dtype=np.uint16)
>>> aset = co.arraysets.init_arrayset(name='dummy_arrayset', prototype=dummy)
>>> aset['0'] = dummy
>>> initialCommitHash = co.commit('single sample added to a dummy arrayset')
>>> co.close()
>>> branch_1 = repo.create_branch(name='testbranch')
>>> co = repo.checkout(write=True, branch='testbranch')
>>> co.arraysets['dummy_arrayset']['0']
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16)
>>> arr = np.arange(10, dtype=np.uint16)
>>> arr += 1
>>> co['dummy_arrayset', '1'] = arr
>>> co.commit('second sample')
>>> co = repo.checkout(write=True, branch='master')
>>> co.merge(message='message for commit (not used for FF merge)',
dev_branch='testbranch')
Working with remotes
$ hangar server
>>> repo.remote.add('origin', 'localhost:50051')
>>> repo.remote.push('origin', 'master')
>>> cloneRepo = Repository('path/to/repository’)
>>> cloneRepo.clone('Alessia Marcolini',
'amarcolini@fbk.eu',
'localhost:50051',
remove_old=True)
>>> cloneRepo.remote.fetch_data('origin', 'master', 1024)
>>> from hangar import Repository
>>> from hangar import make_tf_dataset
>>> import tensorflow as tf
>>> tf.compat.v1.enable_eager_execution()
>>> repo = Repository('.')
>>> co = repo.checkout()
>>> data = co.arraysets['mnist_data']
>>> target = co.arrayset['mnist_target']
>>> dataset = make_tf_dataset([data, target])
>>> dataset = dataset.batch(512)
>>> for b_data, b_target in dataset:
print(b_data.shape, b_target.shape)
Machine Learning dataloaders
TensorFlow PyTorch
>>> from hangar import Repository
>>> from hangar import make_torch_dataset
>>> from torch.utils.data import DataLoader
>>> repo = Repository('.')
>>> co = repo.checkout()
>>> aset = co.arraysets['dummy_aset']
>>> dataset = make_torch_dataset(aset,
index_range=slice(1,100))
>>> loader = DataLoader(dataset,
batch_size=16)
>>> for batch in loader:
train_mode(batch)
data.csv
preprocessed_data.csv model
preprocessed_data_clean.csv model_1
preprocessed_data_clean_1.csv model_final
model_final_v2
The whole story …
adapted from a graphic by @faviovaz
How to keep track of changes?
How to link code, data, model, metrics?
Are you able to ensure
replicability?
https://github.com/iterative/dvc
DVC is designed to be agnostic of frameworks and languages,
and is designed to run on top of Git repositories.
Link to video: https://www.youtube.com/watch?v=4h6I9_xeYA4
To initialise the dvc repo, run
$ dvc init
Then choose a data remote:
• Local
• AWS S3
• Google Cloud Storage
• Azure Blog Storage
• SSH
• HDFS
• HTTP
Then run
$ dvc remote add -d myremote s3://YOUR_BUCKET_NAME/data
$ dvc add data/data.csv create a data/data.csv.dvc file and
add data/data.csv to .gitignore
$ git add data/data.csv.dvc .gitignore
$ git commit -m "add data"
$ dvc push upload data to remote
Add data to DVC
$ rm -f data/data.csv
$ dvc pull
or
$ dvc pull data/data.csv.dvc
Retrieve data
$ dvc run -f preprocess_data.dvc 
-d src/prep.py -d data/data.csv 
-o data/preprocessed_data.csv 
python src/prep.py data/data.csv preprocessed_data.csv
$ git add preprocess_data.dvc .gitignore
$ git commit -m "add preprocessing stage"
$ dvc push
Connect code and data
$ dvc run -f train.dvc 
-d src/train.py -d data/preprocessed_data.csv 
-o results/model.pkl 
python src/train.py data/preprocessed_data.csv model.pkl
$ git add train.dvc .gitignore
$ git commit -m "add training stage"
$ dvc push
Pipelines
$ dvc repro train.dvc
Reproduce
$ git tag -a "baseline-experiment" -m "Baseline experiment"
$ git checkout baseline-experiment
$ dvc checkout
Tag and go
Thank you
amarcolini@fbk.eu @viperale

More Related Content

Recently uploaded

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Version Control For Data Science @ PyCon DE & PyData Berlin // October 11th 2019

  • 2. Git Basics Git is a free and open source  distributed version control system
  • 3. Git Basics Create a new repository Create a new directory, open it and perform a git init to create a new git repository Checkout a repository Create a working copy of a local repository by running the command git clone /path/to/repository
  • 4. Git Basics Workflow The local repository consists of three "trees" maintained by git.
  • 5. Git Basics Add, Commit & Push You can propose changes using git add <filename> To commit these changes use git commit -m "Commit message" To send your changes to your remote repository, execute git push origin master
  • 6. Git Basics Branching Branches are used to develop features isolated from each other. Create a new branch named "feature_x" and switch to it using git checkout -b feature_x and switch back to master git checkout master
  • 7. Git Basics Update & Merge To update your local repository to the newest commit, execute git pull To merge another branch into your active branch, use git merge <branch> To view the changes you made relative to the index, use git diff [<path>…]
  • 8.
  • 9. Nothing more than a json file!
  • 11. Deleted Cell Moved a cell before Same code deleted and added
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17. nbdime to the rescue nbdime provides “content-aware” diffing and merging of Jupyter notebooks. It understands the structure of notebook documents.
  • 18. Install pip install nbdime diff notebooks in your terminal with nbdiff nbdiff notebook_1.ipynb notebook_2.ipynb rich web-based rendering of the diff with nbdiff-web nbdiff-web notebook_1.ipynb notebook_2.ipynb
  • 19. nbdime config-git --enable --global Git integration
  • 20.
  • 21.
  • 22.
  • 23. data.csv original dataset preprocessed_data.csv rescaled dataset → [0, 1] preprocessed_data_clean.csv remove incorrect data preprocessed_data_clean_1.csv remove outliers What about the data?
  • 24.
  • 25. Hangar version control for tensor data https://github.com/tensorwerk/hangar-py Dataset Arrayset Arrayset Arrayset image filename label [[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0 [[1,1,1,1], … ,[0,1,1,1]] "image2.png" [[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1
  • 26. Dataset Arrayset Arrayset Arrayset image filename label [[0,1,0,1], … ,[0,1,1,1]] "image1.png" 0 [[1,1,1,1], … ,[0,1,1,1]] "image2.png" [[0,1,1,1], … ,[1,1,1,0]] "image3.png" 1 Arbitrary Backend Selection Arrayset stored in backend optimised for data of that particular shape / type / layout
  • 27. Working with data >>> from hangar import Repository >>> import numpy as np >>> repo = Repository(path='path/to/repository') >>> repo.init(user_name='Alessia Marcolini', user_email='amarcolini@fbk.eu', remove_old=True) >>> co = repo.checkout(write=True) >>> train_images = np.load('mnist_training_images.npy') >>> co.arraysets.init_arrayset(name='mnist_training_images', prototype=train_images[0]) >>> train_aset = co.arraysets['mnist_training_images'] >>> train_aset['0'] = train_images[0] >>> train_aset.add(data=train_images[1], name='1') >>> train_aset[51] = train_images[51] >>> co.commit('Add mnist dataset') >>> co.close()
  • 28. Branching & Merging >>> dummy = np.arange(10, dtype=np.uint16) >>> aset = co.arraysets.init_arrayset(name='dummy_arrayset', prototype=dummy) >>> aset['0'] = dummy >>> initialCommitHash = co.commit('single sample added to a dummy arrayset') >>> co.close() >>> branch_1 = repo.create_branch(name='testbranch') >>> co = repo.checkout(write=True, branch='testbranch') >>> co.arraysets['dummy_arrayset']['0'] array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint16) >>> arr = np.arange(10, dtype=np.uint16) >>> arr += 1 >>> co['dummy_arrayset', '1'] = arr >>> co.commit('second sample') >>> co = repo.checkout(write=True, branch='master') >>> co.merge(message='message for commit (not used for FF merge)', dev_branch='testbranch')
  • 29. Working with remotes $ hangar server >>> repo.remote.add('origin', 'localhost:50051') >>> repo.remote.push('origin', 'master') >>> cloneRepo = Repository('path/to/repository’) >>> cloneRepo.clone('Alessia Marcolini', 'amarcolini@fbk.eu', 'localhost:50051', remove_old=True) >>> cloneRepo.remote.fetch_data('origin', 'master', 1024)
  • 30. >>> from hangar import Repository >>> from hangar import make_tf_dataset >>> import tensorflow as tf >>> tf.compat.v1.enable_eager_execution() >>> repo = Repository('.') >>> co = repo.checkout() >>> data = co.arraysets['mnist_data'] >>> target = co.arrayset['mnist_target'] >>> dataset = make_tf_dataset([data, target]) >>> dataset = dataset.batch(512) >>> for b_data, b_target in dataset: print(b_data.shape, b_target.shape) Machine Learning dataloaders TensorFlow PyTorch >>> from hangar import Repository >>> from hangar import make_torch_dataset >>> from torch.utils.data import DataLoader >>> repo = Repository('.') >>> co = repo.checkout() >>> aset = co.arraysets['dummy_aset'] >>> dataset = make_torch_dataset(aset, index_range=slice(1,100)) >>> loader = DataLoader(dataset, batch_size=16) >>> for batch in loader: train_mode(batch)
  • 31. data.csv preprocessed_data.csv model preprocessed_data_clean.csv model_1 preprocessed_data_clean_1.csv model_final model_final_v2 The whole story … adapted from a graphic by @faviovaz
  • 32. How to keep track of changes? How to link code, data, model, metrics? Are you able to ensure replicability?
  • 33.
  • 34. https://github.com/iterative/dvc DVC is designed to be agnostic of frameworks and languages, and is designed to run on top of Git repositories.
  • 35. Link to video: https://www.youtube.com/watch?v=4h6I9_xeYA4
  • 36. To initialise the dvc repo, run $ dvc init Then choose a data remote: • Local • AWS S3 • Google Cloud Storage • Azure Blog Storage • SSH • HDFS • HTTP Then run $ dvc remote add -d myremote s3://YOUR_BUCKET_NAME/data
  • 37. $ dvc add data/data.csv create a data/data.csv.dvc file and add data/data.csv to .gitignore $ git add data/data.csv.dvc .gitignore $ git commit -m "add data" $ dvc push upload data to remote Add data to DVC
  • 38. $ rm -f data/data.csv $ dvc pull or $ dvc pull data/data.csv.dvc Retrieve data
  • 39. $ dvc run -f preprocess_data.dvc -d src/prep.py -d data/data.csv -o data/preprocessed_data.csv python src/prep.py data/data.csv preprocessed_data.csv $ git add preprocess_data.dvc .gitignore $ git commit -m "add preprocessing stage" $ dvc push Connect code and data
  • 40. $ dvc run -f train.dvc -d src/train.py -d data/preprocessed_data.csv -o results/model.pkl python src/train.py data/preprocessed_data.csv model.pkl $ git add train.dvc .gitignore $ git commit -m "add training stage" $ dvc push Pipelines
  • 41. $ dvc repro train.dvc Reproduce
  • 42. $ git tag -a "baseline-experiment" -m "Baseline experiment" $ git checkout baseline-experiment $ dvc checkout Tag and go
  • 43.