Sharing 101: Code Reproducibility & Sharing Series

Sharing 101
Code Reproducibility & Sharing
Series
Omnia Mohamed
Data Analytics Engineer , IBM

What you will NOT learn during this session?
How to Code in R
How to be professional Git users

What you’ll get from this session?
How to configure Git and R to play nice together
How to organize your R projects
How to publish your first R project on github
Some tips to make your code more shareable

Secure organized location for your code
Computer crashed?

Building a Career
● An essential skill for work market
● Your git account will be a portfolio of your data science projects
● Base for blogging

Team Collaboration
No need for shared folders
Easier tracking of changes
Code merging capabilities
Easy finger pointing

Who is who?
Git
Open source project for version control originally developed in 2005.
Github
Web-based Git repository hosting service, which offers all of the distributed revision
control and source code management (SCM) functionality.

Where do I start?
Install R & Rstudio
https://cloud.r-project.org/
https://www.rstudio.com/products/rstudio/download/
Install Git
https://gitforwindows.org/

Configure your account on git local
Open Git bash and run the following commands:
git config --global user.name 'Jane Doe'
git config --global user.email 'jane@example.com'
git config --global --list #this should show the configurations you just set

Create your first repository
From Git website :
Create new repository

Create your first repository
Get it local
Using R Studio :
1. File -> new project ->version control -> git
2. Insert repository url that u get from this screen
Or from git bash command line:
git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git

Write your first script
File -> New ->R script
Generate some random data
x <- rnorm(1000)
y <- x * 2 + rnorm(1000)
df <- data.frame(x, y)
Visualize it
ggplot(data = df,mapping = aes(x,y))+geom_point()
Save!

Let’s land on git
Commit to local repository
Add comments
Push to remote repository
Check it out on the web

Tips for new gities
Comment your commits
Commit frequently
Push only tested code
Pull frequently

Sharing data science projects
The Ikea Mode Plug & Play Mode

The keys of a plug & play project
● Has readme file
● Standard coding convention
● Organized project directory
● Reproducible code
● Executable outputs

Read Me File
Project Title
Project scope
Environment and version info
Prerequisite
Installation guide
Example of usage
Authors
Contribution
License
You don’t need to
include all sections,
only the ones applies
to your project

Project Directory organization
Script files
known also
as “scripts”
folder
Markdown
reports each
markdown
has a folder
inside
Your data is saved here under 2 folders:
“Raw” for original data
“Preprocessed” for manipulated and
cleaned data
Each shiny app has a
folder under this one
You can have
additional folders as
you need like docs or
figs

Standard coding convention
Tidy verse style guide
Google R style guide

Make it Readable
File names : meaningful with no special chars and prefixed with order of the file if they
should run in sequence ,ex. 00_dataprep_functions.R
Attribute names : lowercase with _ ,ex. expiry_date
Assignment : using -> instead of ,ex. x <- 5 Alt+ -

Functions naming and commenting
Same naming as objects ,ex:
#' Drop last column of dataframe
#' @param data A dataframe.
#' @return dataframe after dropping last column.
#' @examples
#' drop_last_col(iris)
drop_last_col <- function(data){
dropped_data <- data[-c(length(data))]
return(dropped_data)
}
Function objective
Function parameters
Name is lowercase no special characters ,
opening brackets right after function
definition
Closing brackets at the end on seperate
line

Make it Reproducible - here
here() :
library(here)
file_name -> here(“data”,”file.csv”)
#The file_name string now holds the value of : “myprojectrootfolder/data/file.csv”

Make it Reproducible - Seed
For reproducing data or results that depend on random generation use seed() to
ensure same results every time.
par(mfrow=c(2,2))
for(i in 1:4){
x <- rnorm(1000)
hist(x, main = paste0("fig",i))
}

Make it Reproducible - Seed
par(mfrow=c(2,2))
for(i in 1:4){
set.seed(123)
x <- rnorm(1000)
hist(x, main = paste0("fig",i))
}

Make it reproducible - pacman
Make sure that the packages you use are installed on the running machine:
#check if pacman package doesn’t exist then install it
if(!require(pacman)){
install.packages("pacman")
}
#pacman will check the installation of packages , install them and load them into environment
pacman::p_load("tidyverse", "caTools", "glmnet")

Make it Reproducible
Environment practices:
● Use Packrat for libraries management
● Using checkpoint
● Using docker for full environment sharing

Make it executable
● Use R markdown for reporting analysis (will have a session on it later ;) )
● Use shiny apps for tools and interactive reports
● Use APIs for accessible models (Plumer is your friend)
● Create packages

Now it’s your turn
Fork repo of world life expectancy dataset:
https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-07-03
Create your own project
Organize it your way
Find out :
● Top 3 countries with highest life expectancy in 2015 .
● Top 3 countries who improved over past 20 years.
Share your repo with us on the meetup website
First 3 to submit
with the mentioned
guidelines will win
voucher of 50LE
worth

May the odds be ever in your favor!

References
Git Resources:
https://git-scm.com/book/en/v2
https://happygitwithr.com/install-git.html#install-git-windows
https://www.javaworld.com/article/2113465/git-smart-20-essential-tips-for-git-and-
github-users.html

References - cont.
Reproducibility and project organization:
https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/
https://kbroman.org/steps2rr/pages/organize.html
https://github.com/swcarpentry/good-enough-practices-in-scientific-
computing/blob/gh-pages/good-enough-practices-for-scientific-computing.pdf
Read me template:
https://gist.github.com/PurpleBooth/109311bb0361f32d87a2

References - Cont.
Style guides:
https://style.tidyverse.org/files.html#names
https://google.github.io/styleguide/Rguide.xml
Environment packaging :
https://rstudio.github.io/packrat/walkthrough.html
https://colinfay.me/docker-r-reproducibility/

Sharing 101: Code Reproducibility & Sharing Series

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Sharing 101: Code Reproducibility & Sharing Series

Ähnlich wie Sharing 101: Code Reproducibility & Sharing Series (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sharing 101: Code Reproducibility & Sharing Series

Hinweis der Redaktion