SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Sharing 101
Code Reproducibility & Sharing
Series
Omnia Mohamed
Data Analytics Engineer , IBM
What you will NOT learn during this session?
How to Code in R
How to be professional Git users
What you’ll get from this session?
How to configure Git and R to play nice together
How to organize your R projects
How to publish your first R project on github
Some tips to make your code more shareable
Why we need git?
Version Control
Secure organized location for your code
Computer crashed?
Building a Career
● An essential skill for work market
● Your git account will be a portfolio of your data science projects
● Base for blogging
Team Collaboration
No need for shared folders
Easier tracking of changes
Code merging capabilities
Easy finger pointing
Who is who?
Git
Open source project for version control originally developed in 2005.
Github
Web-based Git repository hosting service, which offers all of the distributed revision
control and source code management (SCM) functionality.
Where do I start?
Install R & Rstudio
https://cloud.r-project.org/
https://www.rstudio.com/products/rstudio/download/
Install Git
https://gitforwindows.org/
Configure your account on git local
Open Git bash and run the following commands:
git config --global user.name 'Jane Doe'
git config --global user.email 'jane@example.com'
git config --global --list #this should show the configurations you just set
Create your first repository
From Git website :
Create new repository
Create your first repository
Get it local
Using R Studio :
1. File -> new project ->version control -> git
2. Insert repository url that u get from this screen
Or from git bash command line:
git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git
Write your first script
File -> New ->R script
Generate some random data
x <- rnorm(1000)
y <- x * 2 + rnorm(1000)
df <- data.frame(x, y)
Visualize it
ggplot(data = df,mapping = aes(x,y))+geom_point()
Save!
Let’s land on git
Simple git workflow
Let’s land on git
Commit to local repository
Add comments
Push to remote repository
Check it out on the web
Tips for new gities
Comment your commits
Commit frequently
Push only tested code
Pull frequently
Sharing data science projects
The Ikea Mode Plug & Play Mode
The keys of a plug & play project
● Has readme file
● Standard coding convention
● Organized project directory
● Reproducible code
● Executable outputs
Read Me File
Project Title
Project scope
Environment and version info
Prerequisite
Installation guide
Example of usage
Authors
Contribution
License
You don’t need to
include all sections,
only the ones applies
to your project
Project Directory organization
Script files
known also
as “scripts”
folder
Markdown
reports each
markdown
has a folder
inside
Your data is saved here under 2 folders:
“Raw” for original data
“Preprocessed” for manipulated and
cleaned data
Each shiny app has a
folder under this one
You can have
additional folders as
you need like docs or
figs
Standard coding convention
Tidy verse style guide
Google R style guide
Make it Readable
File names : meaningful with no special chars and prefixed with order of the file if they
should run in sequence ,ex. 00_dataprep_functions.R
Attribute names : lowercase with _ ,ex. expiry_date
Assignment : using -> instead of ,ex. x <- 5 Alt+ -
Functions naming and commenting
Same naming as objects ,ex:
#' Drop last column of dataframe
#' @param data A dataframe.
#' @return dataframe after dropping last column.
#' @examples
#' drop_last_col(iris)
drop_last_col <- function(data){
dropped_data <- data[-c(length(data))]
return(dropped_data)
}
Function objective
Function parameters
Name is lowercase no special characters ,
opening brackets right after function
definition
Closing brackets at the end on seperate
line
Make it Reproducible - here
here() :
library(here)
file_name -> here(“data”,”file.csv”)
#The file_name string now holds the value of : “myprojectrootfolder/data/file.csv”
Make it Reproducible - Seed
For reproducing data or results that depend on random generation use seed() to
ensure same results every time.
par(mfrow=c(2,2))
for(i in 1:4){
x <- rnorm(1000)
hist(x, main = paste0("fig",i))
}
Make it Reproducible - Seed
par(mfrow=c(2,2))
for(i in 1:4){
set.seed(123)
x <- rnorm(1000)
hist(x, main = paste0("fig",i))
}
Make it reproducible - pacman
Make sure that the packages you use are installed on the running machine:
#check if pacman package doesn’t exist then install it
if(!require(pacman)){
install.packages("pacman")
}
#pacman will check the installation of packages , install them and load them into environment
pacman::p_load("tidyverse", "caTools", "glmnet")
Make it Reproducible
Environment practices:
● Use Packrat for libraries management
● Using checkpoint
● Using docker for full environment sharing
Make it executable
● Use R markdown for reporting analysis (will have a session on it later ;) )
● Use shiny apps for tools and interactive reports
● Use APIs for accessible models (Plumer is your friend)
● Create packages
Now it’s your turn
Fork repo of world life expectancy dataset:
https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-07-03
Create your own project
Organize it your way
Find out :
● Top 3 countries with highest life expectancy in 2015 .
● Top 3 countries who improved over past 20 years.
Share your repo with us on the meetup website
First 3 to submit
with the mentioned
guidelines will win
voucher of 50LE
worth
May the odds be ever in your favor!
References
Git Resources:
https://git-scm.com/book/en/v2
https://happygitwithr.com/install-git.html#install-git-windows
https://www.javaworld.com/article/2113465/git-smart-20-essential-tips-for-git-and-
github-users.html
References - cont.
Reproducibility and project organization:
https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/
https://kbroman.org/steps2rr/pages/organize.html
https://github.com/swcarpentry/good-enough-practices-in-scientific-
computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf
Read me template:
https://gist.github.com/PurpleBooth/109311bb0361f32d87a2
References - Cont.
Style guides:
https://style.tidyverse.org/files.html#names
https://google.github.io/styleguide/Rguide.xml
Environment packaging :
https://rstudio.github.io/packrat/walkthrough.html
https://colinfay.me/docker-r-reproducibility/
Thank you!
Cairo
Meetup

Weitere ähnliche Inhalte

Was ist angesagt?

Overlay Technique | Pebble Developer Retreat 2014
Overlay Technique | Pebble Developer Retreat 2014Overlay Technique | Pebble Developer Retreat 2014
Overlay Technique | Pebble Developer Retreat 2014Pebble Technology
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformIMC Institute
 
Device-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsDevice-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsemBO_Conference
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDBTuri, Inc.
 
Infrastructure as Code & Terraform 101
Infrastructure as Code & Terraform 101Infrastructure as Code & Terraform 101
Infrastructure as Code & Terraform 101Kristoffer Ahl
 
Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...
Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...
Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...eCommConf
 
Bypassing DEP using ROP
Bypassing DEP using ROPBypassing DEP using ROP
Bypassing DEP using ROPJapneet Singh
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_countKohei KaiGai
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsemBO_Conference
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in RustInfluxData
 
HBase based map reduce job unit testing
HBase based map reduce job unit testingHBase based map reduce job unit testing
HBase based map reduce job unit testingAshok Agarwal
 
#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk
#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk
#PDR15 Creating Pebble Apps for Aplite, Basalt, and ChalkPebble Technology
 
Cypher for Gremlin
Cypher for GremlinCypher for Gremlin
Cypher for GremlinopenCypher
 
Scaling with Python: SF Python Meetup, September 2017
Scaling with Python: SF Python Meetup, September 2017Scaling with Python: SF Python Meetup, September 2017
Scaling with Python: SF Python Meetup, September 2017Varun Varma
 
Linux Kernel 개발참여방법과 문화 (Contribution)
Linux Kernel 개발참여방법과 문화 (Contribution)Linux Kernel 개발참여방법과 문화 (Contribution)
Linux Kernel 개발참여방법과 문화 (Contribution)Ubuntu Korea Community
 

Was ist angesagt? (20)

Overlay Technique | Pebble Developer Retreat 2014
Overlay Technique | Pebble Developer Retreat 2014Overlay Technique | Pebble Developer Retreat 2014
Overlay Technique | Pebble Developer Retreat 2014
 
Adding CF Attributes to an HDF5 File
Adding CF Attributes to an HDF5 FileAdding CF Attributes to an HDF5 File
Adding CF Attributes to an HDF5 File
 
Mahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud PlatformMahout Workshop on Google Cloud Platform
Mahout Workshop on Google Cloud Platform
 
PyHEP 2019: Python 3.8
PyHEP 2019: Python 3.8PyHEP 2019: Python 3.8
PyHEP 2019: Python 3.8
 
Device-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded SystemsDevice-specific Clang Tooling for Embedded Systems
Device-specific Clang Tooling for Embedded Systems
 
Time Series Data with InfluxDB
Time Series Data with InfluxDBTime Series Data with InfluxDB
Time Series Data with InfluxDB
 
Infrastructure as Code & Terraform 101
Infrastructure as Code & Terraform 101Infrastructure as Code & Terraform 101
Infrastructure as Code & Terraform 101
 
Sorter
SorterSorter
Sorter
 
Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...
Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...
Tim Panton - Presentation at Emerging Communications Conference & Awards (eCo...
 
Limits Profiling
Limits ProfilingLimits Profiling
Limits Profiling
 
Bypassing DEP using ROP
Bypassing DEP using ROPBypassing DEP using ROP
Bypassing DEP using ROP
 
20210928_pgunconf_hll_count
20210928_pgunconf_hll_count20210928_pgunconf_hll_count
20210928_pgunconf_hll_count
 
Profiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf ToolsProfiling your Applications using the Linux Perf Tools
Profiling your Applications using the Linux Perf Tools
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
HBase based map reduce job unit testing
HBase based map reduce job unit testingHBase based map reduce job unit testing
HBase based map reduce job unit testing
 
#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk
#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk
#PDR15 Creating Pebble Apps for Aplite, Basalt, and Chalk
 
Cypher for Gremlin
Cypher for GremlinCypher for Gremlin
Cypher for Gremlin
 
Scaling with Python: SF Python Meetup, September 2017
Scaling with Python: SF Python Meetup, September 2017Scaling with Python: SF Python Meetup, September 2017
Scaling with Python: SF Python Meetup, September 2017
 
Linux Kernel 개발참여방법과 문화 (Contribution)
Linux Kernel 개발참여방법과 문화 (Contribution)Linux Kernel 개발참여방법과 문화 (Contribution)
Linux Kernel 개발참여방법과 문화 (Contribution)
 
Terraform 101
Terraform 101Terraform 101
Terraform 101
 

Ähnlich wie Sharing 101: Code Reproducibility & Sharing Series

Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaJoe Stein
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using SwiftDiego Freniche Brito
 
Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014biicode
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R projectWLOG Solutions
 
Open source projects with python
Open source projects with pythonOpen source projects with python
Open source projects with pythonroskakori
 
Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)
Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)
Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)Fabrice Bernhard
 
Advanced Malware Analysis Training Session 5 - Reversing Automation
Advanced Malware Analysis Training Session 5 - Reversing AutomationAdvanced Malware Analysis Training Session 5 - Reversing Automation
Advanced Malware Analysis Training Session 5 - Reversing Automationsecurityxploded
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-wayRobert Lujo
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MoreMatt Harrison
 
DevOps(4) : Ansible(2) - (MOSG)
DevOps(4) : Ansible(2) - (MOSG)DevOps(4) : Ansible(2) - (MOSG)
DevOps(4) : Ansible(2) - (MOSG)Soshi Nemoto
 
"I have a framework idea" - Repeat less, share more.
"I have a framework idea" - Repeat less, share more."I have a framework idea" - Repeat less, share more.
"I have a framework idea" - Repeat less, share more.Fabio Milano
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
Lean Drupal Repositories with Composer and Drush
Lean Drupal Repositories with Composer and DrushLean Drupal Repositories with Composer and Drush
Lean Drupal Repositories with Composer and DrushPantheon
 
C# Production Debugging Made Easy
 C# Production Debugging Made Easy C# Production Debugging Made Easy
C# Production Debugging Made EasyAlon Fliess
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdbRoman Podoliaka
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRANRevolution Analytics
 

Ähnlich wie Sharing 101: Code Reproducibility & Sharing Series (20)

Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Developing with the Go client for Apache Kafka
Developing with the Go client for Apache KafkaDeveloping with the Go client for Apache Kafka
Developing with the Go client for Apache Kafka
 
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides:  Let's build macOS CLI Utilities using SwiftMobileConf 2021 Slides:  Let's build macOS CLI Utilities using Swift
MobileConf 2021 Slides: Let's build macOS CLI Utilities using Swift
 
Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014
 
How to lock a Python in a cage? Managing Python environment inside an R project
How to lock a Python in a cage?  Managing Python environment inside an R projectHow to lock a Python in a cage?  Managing Python environment inside an R project
How to lock a Python in a cage? Managing Python environment inside an R project
 
Open source projects with python
Open source projects with pythonOpen source projects with python
Open source projects with python
 
Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)
Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)
Adopt DevOps philosophy on your Symfony projects (Symfony Live 2011)
 
Ab initio training Ab-initio Architecture
Ab initio training Ab-initio ArchitectureAb initio training Ab-initio Architecture
Ab initio training Ab-initio Architecture
 
Advanced Malware Analysis Training Session 5 - Reversing Automation
Advanced Malware Analysis Training Session 5 - Reversing AutomationAdvanced Malware Analysis Training Session 5 - Reversing Automation
Advanced Malware Analysis Training Session 5 - Reversing Automation
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Devtools cheatsheet
Devtools cheatsheetDevtools cheatsheet
Devtools cheatsheet
 
Django dev-env-my-way
Django dev-env-my-wayDjango dev-env-my-way
Django dev-env-my-way
 
PyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and MorePyCon 2013 : Scripting to PyPi to GitHub and More
PyCon 2013 : Scripting to PyPi to GitHub and More
 
DevOps(4) : Ansible(2) - (MOSG)
DevOps(4) : Ansible(2) - (MOSG)DevOps(4) : Ansible(2) - (MOSG)
DevOps(4) : Ansible(2) - (MOSG)
 
"I have a framework idea" - Repeat less, share more.
"I have a framework idea" - Repeat less, share more."I have a framework idea" - Repeat less, share more.
"I have a framework idea" - Repeat less, share more.
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
Lean Drupal Repositories with Composer and Drush
Lean Drupal Repositories with Composer and DrushLean Drupal Repositories with Composer and Drush
Lean Drupal Repositories with Composer and Drush
 
C# Production Debugging Made Easy
 C# Production Debugging Made Easy C# Production Debugging Made Easy
C# Production Debugging Made Easy
 
Debugging Python with gdb
Debugging Python with gdbDebugging Python with gdb
Debugging Python with gdb
 
Through the firewall with miniCRAN
Through the firewall with miniCRANThrough the firewall with miniCRAN
Through the firewall with miniCRAN
 

Kürzlich hochgeladen

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Kürzlich hochgeladen (20)

Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Sharing 101: Code Reproducibility & Sharing Series

  • 1. Sharing 101 Code Reproducibility & Sharing Series Omnia Mohamed Data Analytics Engineer , IBM
  • 2. What you will NOT learn during this session? How to Code in R How to be professional Git users
  • 3. What you’ll get from this session? How to configure Git and R to play nice together How to organize your R projects How to publish your first R project on github Some tips to make your code more shareable
  • 4. Why we need git?
  • 6. Secure organized location for your code Computer crashed?
  • 7. Building a Career ● An essential skill for work market ● Your git account will be a portfolio of your data science projects ● Base for blogging
  • 8. Team Collaboration No need for shared folders Easier tracking of changes Code merging capabilities Easy finger pointing
  • 9. Who is who? Git Open source project for version control originally developed in 2005. Github Web-based Git repository hosting service, which offers all of the distributed revision control and source code management (SCM) functionality.
  • 10. Where do I start? Install R & Rstudio https://cloud.r-project.org/ https://www.rstudio.com/products/rstudio/download/ Install Git https://gitforwindows.org/
  • 11. Configure your account on git local Open Git bash and run the following commands: git config --global user.name 'Jane Doe' git config --global user.email 'jane@example.com' git config --global --list #this should show the configurations you just set
  • 12. Create your first repository From Git website : Create new repository
  • 13. Create your first repository Get it local Using R Studio : 1. File -> new project ->version control -> git 2. Insert repository url that u get from this screen Or from git bash command line: git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git
  • 14. Write your first script File -> New ->R script Generate some random data x <- rnorm(1000) y <- x * 2 + rnorm(1000) df <- data.frame(x, y) Visualize it ggplot(data = df,mapping = aes(x,y))+geom_point() Save!
  • 17. Let’s land on git Commit to local repository Add comments Push to remote repository Check it out on the web
  • 18. Tips for new gities Comment your commits Commit frequently Push only tested code Pull frequently
  • 19. Sharing data science projects The Ikea Mode Plug & Play Mode
  • 20. The keys of a plug & play project ● Has readme file ● Standard coding convention ● Organized project directory ● Reproducible code ● Executable outputs
  • 21. Read Me File Project Title Project scope Environment and version info Prerequisite Installation guide Example of usage Authors Contribution License You don’t need to include all sections, only the ones applies to your project
  • 22. Project Directory organization Script files known also as “scripts” folder Markdown reports each markdown has a folder inside Your data is saved here under 2 folders: “Raw” for original data “Preprocessed” for manipulated and cleaned data Each shiny app has a folder under this one You can have additional folders as you need like docs or figs
  • 23. Standard coding convention Tidy verse style guide Google R style guide
  • 24. Make it Readable File names : meaningful with no special chars and prefixed with order of the file if they should run in sequence ,ex. 00_dataprep_functions.R Attribute names : lowercase with _ ,ex. expiry_date Assignment : using -> instead of ,ex. x <- 5 Alt+ -
  • 25. Functions naming and commenting Same naming as objects ,ex: #' Drop last column of dataframe #' @param data A dataframe. #' @return dataframe after dropping last column. #' @examples #' drop_last_col(iris) drop_last_col <- function(data){ dropped_data <- data[-c(length(data))] return(dropped_data) } Function objective Function parameters Name is lowercase no special characters , opening brackets right after function definition Closing brackets at the end on seperate line
  • 26. Make it Reproducible - here here() : library(here) file_name -> here(“data”,”file.csv”) #The file_name string now holds the value of : “myprojectrootfolder/data/file.csv”
  • 27. Make it Reproducible - Seed For reproducing data or results that depend on random generation use seed() to ensure same results every time. par(mfrow=c(2,2)) for(i in 1:4){ x <- rnorm(1000) hist(x, main = paste0("fig",i)) }
  • 28. Make it Reproducible - Seed par(mfrow=c(2,2)) for(i in 1:4){ set.seed(123) x <- rnorm(1000) hist(x, main = paste0("fig",i)) }
  • 29. Make it reproducible - pacman Make sure that the packages you use are installed on the running machine: #check if pacman package doesn’t exist then install it if(!require(pacman)){ install.packages("pacman") } #pacman will check the installation of packages , install them and load them into environment pacman::p_load("tidyverse", "caTools", "glmnet")
  • 30. Make it Reproducible Environment practices: ● Use Packrat for libraries management ● Using checkpoint ● Using docker for full environment sharing
  • 31. Make it executable ● Use R markdown for reporting analysis (will have a session on it later ;) ) ● Use shiny apps for tools and interactive reports ● Use APIs for accessible models (Plumer is your friend) ● Create packages
  • 32. Now it’s your turn Fork repo of world life expectancy dataset: https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-07-03 Create your own project Organize it your way Find out : ● Top 3 countries with highest life expectancy in 2015 . ● Top 3 countries who improved over past 20 years. Share your repo with us on the meetup website First 3 to submit with the mentioned guidelines will win voucher of 50LE worth
  • 33. May the odds be ever in your favor!
  • 35. References - cont. Reproducibility and project organization: https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/ https://kbroman.org/steps2rr/pages/organize.html https://github.com/swcarpentry/good-enough-practices-in-scientific- computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf Read me template: https://gist.github.com/PurpleBooth/109311bb0361f32d87a2
  • 36. References - Cont. Style guides: https://style.tidyverse.org/files.html#names https://google.github.io/styleguide/Rguide.xml Environment packaging : https://rstudio.github.io/packrat/walkthrough.html https://colinfay.me/docker-r-reproducibility/

Hinweis der Redaktion

  1. https://happygitwithr.com/install-git.html#install-git-windows
  2. https://www.javaworld.com/article/2113465/git-smart-20-essential-tips-for-git-and-github-users.html
  3. https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/ https://kbroman.org/steps2rr/pages/organize.html https://github.com/swcarpentry/good-enough-practices-in-scientific-computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf
  4. https://gist.github.com/PurpleBooth/109311bb0361f32d87a2
  5. https://google.github.io/styleguide/Rguide.xml
  6. https://rstudio.github.io/packrat/walkthrough.html