2. What you will NOT learn during this session?
How to Code in R
How to be professional Git users
3. What you’ll get from this session?
How to configure Git and R to play nice together
How to organize your R projects
How to publish your first R project on github
Some tips to make your code more shareable
7. Building a Career
● An essential skill for work market
● Your git account will be a portfolio of your data science projects
● Base for blogging
8. Team Collaboration
No need for shared folders
Easier tracking of changes
Code merging capabilities
Easy finger pointing
9. Who is who?
Git
Open source project for version control originally developed in 2005.
Github
Web-based Git repository hosting service, which offers all of the distributed revision
control and source code management (SCM) functionality.
10. Where do I start?
Install R & Rstudio
https://cloud.r-project.org/
https://www.rstudio.com/products/rstudio/download/
Install Git
https://gitforwindows.org/
11. Configure your account on git local
Open Git bash and run the following commands:
git config --global user.name 'Jane Doe'
git config --global user.email 'jane@example.com'
git config --global --list #this should show the configurations you just set
12. Create your first repository
From Git website :
Create new repository
13. Create your first repository
Get it local
Using R Studio :
1. File -> new project ->version control -> git
2. Insert repository url that u get from this screen
Or from git bash command line:
git clone https://github.com/YOUR-USERNAME/YOUR-REPOSITORY.git
14. Write your first script
File -> New ->R script
Generate some random data
x <- rnorm(1000)
y <- x * 2 + rnorm(1000)
df <- data.frame(x, y)
Visualize it
ggplot(data = df,mapping = aes(x,y))+geom_point()
Save!
20. The keys of a plug & play project
● Has readme file
● Standard coding convention
● Organized project directory
● Reproducible code
● Executable outputs
21. Read Me File
Project Title
Project scope
Environment and version info
Prerequisite
Installation guide
Example of usage
Authors
Contribution
License
You don’t need to
include all sections,
only the ones applies
to your project
22. Project Directory organization
Script files
known also
as “scripts”
folder
Markdown
reports each
markdown
has a folder
inside
Your data is saved here under 2 folders:
“Raw” for original data
“Preprocessed” for manipulated and
cleaned data
Each shiny app has a
folder under this one
You can have
additional folders as
you need like docs or
figs
24. Make it Readable
File names : meaningful with no special chars and prefixed with order of the file if they
should run in sequence ,ex. 00_dataprep_functions.R
Attribute names : lowercase with _ ,ex. expiry_date
Assignment : using -> instead of ,ex. x <- 5 Alt+ -
25. Functions naming and commenting
Same naming as objects ,ex:
#' Drop last column of dataframe
#' @param data A dataframe.
#' @return dataframe after dropping last column.
#' @examples
#' drop_last_col(iris)
drop_last_col <- function(data){
dropped_data <- data[-c(length(data))]
return(dropped_data)
}
Function objective
Function parameters
Name is lowercase no special characters ,
opening brackets right after function
definition
Closing brackets at the end on seperate
line
26. Make it Reproducible - here
here() :
library(here)
file_name -> here(“data”,”file.csv”)
#The file_name string now holds the value of : “myprojectrootfolder/data/file.csv”
27. Make it Reproducible - Seed
For reproducing data or results that depend on random generation use seed() to
ensure same results every time.
par(mfrow=c(2,2))
for(i in 1:4){
x <- rnorm(1000)
hist(x, main = paste0("fig",i))
}
28. Make it Reproducible - Seed
par(mfrow=c(2,2))
for(i in 1:4){
set.seed(123)
x <- rnorm(1000)
hist(x, main = paste0("fig",i))
}
29. Make it reproducible - pacman
Make sure that the packages you use are installed on the running machine:
#check if pacman package doesn’t exist then install it
if(!require(pacman)){
install.packages("pacman")
}
#pacman will check the installation of packages , install them and load them into environment
pacman::p_load("tidyverse", "caTools", "glmnet")
30. Make it Reproducible
Environment practices:
● Use Packrat for libraries management
● Using checkpoint
● Using docker for full environment sharing
31. Make it executable
● Use R markdown for reporting analysis (will have a session on it later ;) )
● Use shiny apps for tools and interactive reports
● Use APIs for accessible models (Plumer is your friend)
● Create packages
32. Now it’s your turn
Fork repo of world life expectancy dataset:
https://github.com/rfordatascience/tidytuesday/tree/master/data/2018/2018-07-03
Create your own project
Organize it your way
Find out :
● Top 3 countries with highest life expectancy in 2015 .
● Top 3 countries who improved over past 20 years.
Share your repo with us on the meetup website
First 3 to submit
with the mentioned
guidelines will win
voucher of 50LE
worth