This document provides an agenda for an R programming presentation. It includes an introduction to R, commonly used packages and datasets in R, basics of R like data structures and manipulation, looping concepts, data analysis techniques using dplyr and other packages, data visualization using ggplot2, and machine learning algorithms in R. Shortcuts for the R console and IDE are also listed.
2. AGENDA
• Introduction to R
• Packages Covered
• Datasets Covered
• Basics of R
• Looping in R
• Data Analysis in R
• Machine Learning Algorithms
3. Introduction to R
• A Programming Language & free software
environment for statistical computing.
• Most popular Graphical User Interface(GUI),
Widely used among Statisticians & Data
Miners for developing statistical software
and data analysis.
• Highly Extensible through the use of user-
submitted packages for specific functions.
4. R Script
R Console
Global Environment
R differs from RStudio. One can use R without using RStudio, but can't use RStudio without
using R, so R comes first.
Plots, Packages & Help Tab
5. Shortcuts Used
• Ctrl + L : Clears R Console.
• Alt + - : Assigns a name to a variable.
• Ctrl + Shift + M : Assigns Pipe Operator (%>%)
• Ctrl + Shift + N : Opens a new R Script.
• Ctrl + O : Opening an existing R Script.
• Ctrl + S : Saving the current R Script.
• Ctrl + Q : Quits the current R Session.
7. • Amelia : A program for missing data.
• animation : A Gallery of animations in Statistics & Utilities to create animations.
• car : Companion to Applied Regression.
• caret : Classification and Regression Training.
• caTools : Tools : Moving window statistics, GIF, Base64, ROC AUC etc.
• class : Functions for Classification.
• corrplot : Visualization of a Correlation Matrix.
• cowplot : Streamlined Plot Theme and Plot Annotations for ‘ggplot2’.
• dplyr : A Grammar of Data Manipulation.
• e1071 : Misc Functions of the Department of Statistics, Probability Theory.
• ggplot2 : Create Elegant Data Visualisations using the Grammar of Graphics.
• ggplot2movies : Movies Data
• ggrepel : Automatically position non-overlapping text labels with ggplot2.
8. • hflights : Flights that departed Houston in 2011.
• leaflet : Create Interactive Web Maps with the JavaScript ‘Leaflet’ library.
• magrittr : A Forward-Pipe Operator for R.
• Metrics : Evaluation Metrics for Machine Learning.
• mlr : Machine Learning in R.
• partykit : A Toolkit for Recursive Partytioning.
• plyr : Tools for Splitting, Applying and Combining Data.
• randomForest : Breiman & Cutler’s Random Forests for classification & regression.
• rpart : Recursive Partitioning and Regression Trees.
• scales : Scale Functions for Visualization.
• tibble : Simple Data Frames.
• tidyr : Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions.
• VIM : Visualization and Imputation of Missing values.
10. • In-Built datasets like iris, mtcars, state.x77 are also used extensively apart from the following mentioned datasets :
1. Bike Data : A 121 x 9 dataset to understand basic indexing in R.
2. Future 500 : A 500 x 10 dataset to understand data visualization.
3. Movies Data : A 58788 x 24 dataset that comes under ggplot2movies package.
4. Weather (Australia) : A 144187 x 24 dataset to explore tools of dplyr package.
5. Flights Data : A 227496 x 21 dataset that comes under hflights package.
6. Meterology Data : Weather datasets of 4 cities in US to understand looping in R.
7. Big Mart Sales Data : A dataset to understand Exploratory Data Analysis in R.
8. Basketball, Wine Quality Data : Datasets to understand Decision Tree mechanism.
9. Titanic, Home Credit Data : Datasets downloaded from Kaggle to understand ML Basics.
10. Income Data : A dataset to further practice Supervised Machine Learning Algorithms.
11. Social Network Ads Data : A 400 x 5 dataset to understand k-NN Algorithm.
12. Uber Data : A dataset showing Month-wise details of customers boarding cabs & their location.
12. • weekend <- c(“Sat", “Sun") : Saves a variable named weekend in the environment containing 2 character
values : Sat, Sun
• data <- read.csv(file.choose(),header = T,sep = ",",na.strings = c(“ “)) : Standard Code for Reading csv file in R.
• class(data) : States the class type of the dataset.
• str(data) : Displays the structure/class of all the variables within the data.
• summary(data) : Shows the minimum, maximum, mean, median, 1st & 3rd Quartiles respectively.
• df <- as.data.frame(data) : Stores data into a data frame.
• matx <- as.matrix(data) : Stores data into a matrix.
• getwd() : Displays the working directory.
• rownames(data) : Shows all the row names of the data.
• colnames(data) : Shows all the column names of the data.
• nrow(data) : Shows the count of number of rows in the dataset.
• ncol(data) : Counts the number of columns in the dataset.
13. • length(data) : Shows the length (variables count) of the dataset.
• install.packages(“packagename”) : Installing a package in R
• library(package) : Activate a package to further perform functions.
• names(data) : Displays all the variables of the dataset.
• dim(data) : Shows the dimensions of the data (No. of rows & columns).
• sum(is.na(data)) : Shows the total NA values present in the dataset.
• attach(df) : Attaching a dataset named df in R.
• detach(df) : Detaching a dataset named df.
• head(data) : Prints first 6 rows of the dataset.
• tail(data) : Prints last 6 rows of the dataset.
• print(tibble, n=20) : Prints first 20 rows of the dataset.
14. • data$variable<-as.character(data$variable) : Saving the variable as character type.
• data[1,] : Prints 1st row of the dataset.
• data[,3] : Prints 3rd column of the dataset.
• table(data$variable) : Tabular view of the variable (Frequency Distribution).
• round(mean(data$variable,na.rm = T),2) : Rounds off the mean of variable(removing NA values) upto 2 decimals.
• sort(prop.table(data$variable),decreasing=T) : Tabular view of proportions of variable totals in descending order.
• complete.cases(variable) : ! is.na(variable) : Boolean Output (True/False) of Non-NA Values(Shows True for
Non-NA Values).
• list1 <- list(mtcars, iris) : Stores a variable named list1 containing list of 2 mentioned datasets in the global
environment.
• which(iris$Species == "setosa") : Displays all the row numbers where the given criteria is satisfied.
• rm(list = ls()) : Clears all the stored datasets and variables from the global environment.
16. A <- c("what","is","truth")
• COMMAND :
if("Truth" %in% A){
print("Truth is found")
}else{
print("Not Found")
}
• OUTPUT :
Not Found
A <- c(1:10)
• COMMAND :
for(i in 1:3){
print(A)
}
• OUTPUT :
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
apply(mtcars, 1, mean)
• OUTPUT :
Displays mean of all variables row-wise.
lapply(weather, t) [6th dataset (9th Slide)]
• OUTPUT :
Displays transpose of data in the form of a list.
lapply(weather, "[",1) : Access 1st column of all
datasets in the list.
• Creating a Function :
missingvalue <- function(x){
return(sum(is.na(x)))
}
sapply(weather, missingvalue)
• OUTPUT :
Displays count of missing values in the form of a
table.
• Alternative Code for the same task :
sapply(weather, function(x) sum(is.na(x)))
18. 1. First step to deal with a typical business problem is by making hypothesis, then performing
Exploratory Data Analysis up till Step 6.
2. Data will be received - Extract all the Variables/Features from the data.
3. Understanding the Data in Detail - Finding Patterns, Outliers, Anomalies & Missing Values.
4. Performing Univariate Analysis - Single Variable Analysis.
5. Conduct Multivariate Analysis - Categorical Vs. Categorical, Numerical Vs. Categorical,
Numerical Vs. Numerical Variables.
6. Missing Values Imputation, Treatment of Outliers & Anomalies (if any).
7. Apply Feature Engineering - Variable Transformation.
8. Perform Scaling of the Data - Bringing the Data near to Normal Distribution.
9. Apply ML Algorithm, make predictions & test the Accuracy.
19. • Pipe Operator :
%>% : Passes object on left hand side as first argument of function on right hand side.
E.G : iris %>% names() : States all the variable names of the iris dataset.
• Tibble :
iris<- tbl_df(iris) : Converts data to tibble class, which are easier to examine than data frames.
• Select() :
iris %>% select(Sepal.Length, Petal.Length, Species) : Select columns by name
• Filter() :
filter(mtcars, cyl>4 & gear>4) : Extract rows that meet logical criteria.
• Rename() :
iris <- iris %>% rename(Sepal_Length = Sepal.Length) : Renaming a Variable.
20. • Select_If() :
iris.num <- iris %>% select_if(is.numeric) : Extracting the data basis condition.
• Match Operator ( %in% ) :
8 %in% c(1,2,9,5,3,6,7,4,5) : OUTPUT : FALSE
iris %>% filter(Species %in% c("setosa", "virginica")) %>% summary() : Summarises the data where
the condition is TRUE.
• Helper Functions :
select(iris, contains(".")) : Select columns whose name contains a character string.
select(iris, starts_with("Sepal")) : Select columns whose name starts with a character string.
select(iris, ends_with("Length")) : Select columns whose name ends with a character string.
select(iris, everything()) : Select every column.
select(iris, Sepal.Length:Petal.Width) : Select all columns between Sepal.Length and Petal.Width
(inclusive).
select(iris, -Species) : Select all columns except Species.
iris<- iris %>% select(Sepal_Length = Sepal.Length) : Renaming a variable name.
21. • Arrange() :
hflights %>% arrange(DepDelay) : Arranges the data in ascending order basis DepDelay.
hflights %>% arrange(desc(DepDelay)) : Arranges the data in descending order basis DepDelay.
• Group_By() :
iris %>% group_by(Species) : Group data into rows with the same value of Species.
• Summarise() :
iris %>% group_by(Species) %>% summarise(Count=n()) : Summarises the count of Species in
data.
hflights %>% group_by(Month) %>% summarise(dist = sum(Distance)) %>% summarise(
Minimum_Distance = min(dist), Maximum_Distance = max(dist), Mean_Distance = mean(dist),
Median_Distance = median(dist)) : Summarises the min., max., mean and median of total monthly
distances.
• Tally() :
hflights %>% group_by(Month) %>% tally(sort = T) : Adds a frequency column to the table.
22. • Mutate() :
mutate(iris, Sepal = Sepal.Length + Sepal.Width) : Compute and append one or more new columns.
• Transmute() :
transmute(iris, sepal = Sepal.Length + Sepal. Width) :Compute one or more new columns. Drop
original columns.
• Slice() :
slice(hflights, 100:106) : Slice rows by position.
• Distinct() :
iris %>% select(Species) %>% distinct() : Removes duplicates.
• If_Else() :
df <- data.frame(x=c(1,NA,6,5))
df <- df %>% mutate(New_Variable = if_else(x<5, x+1, x+2, 0)) : Last argument is to replace
missing values (NA).
23. • Union() :
union(y, z) : union(mtcars[1:16,], mtcars[17:32,]) : Rows that appear in either or both y and z.
• Intersect() :
intersect(y, z) : intersect(mtcars[1:16,], mtcars[16:32,]) : Rows that appear in both y and z.
• Between() :
hflights %>% filter(between(DepTime,600,605)) : Displays rows with variable value(DepTime) lying
between specified values.
• Count() :
hflights %>% count(Month, sort = T) : A shortcut that does grouping as well as creating a frequency
table.
• Bind_Rows() :
bind_rows(y, z) : bind_rows(mtcars[1:16,], mtcars[17:32,]) : Append z to y as new rows.
25. • Density Plot :
• ggplot(movies, aes(rating)) + geom_density(color = "red") + labs(title="Density Plot",
subtitle="Movies Data", x="Ratings", y="Density", caption="Source : Movies Data") + theme_grey()
• Heat Map (Bin2d Map) :
• ggplot(movies, aes(x=year, y=rating)) + geom_bin2d() + labs(title = "Heat Map", subtitle = "Year
Vs. Rating", x="Year", y="Rating", caption="Source : Movies Data") + theme_classic()
• Violin Plot :
• iris %>% ggplot(aes(Species, Sepal.Length)) + geom_violin(fill = "red", alpha = 0.75) + labs(title =
"Violin-Plot", subtitle = "Species Vs. Sepal.Length", caption = "Source : Iris Data")
• Correlation Plot :
• corrplot(cor(mtcars),method = "circle") : Works with data containing all numeric variables only.
• Cowplot :
• plots <- plot_grid(A, B, nrow = 1) : Here, A, B and C are 3 stored plots respectively.
• plot_grid(plots, C, ncol = 1) : This command will plot all 3 plots together.
26. • GG Repel :
• ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5) +
geom_text_repel(aes(label = rownames(mtcars)), color='blue') + theme_minimal()
• Faceting & Flipping :
• Facet divides a plot into sub-plots.
• ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))+ geom_point(aes(color=Species, size=Species))
+ facet_wrap(~Species)
• ggplot(mtcars, aes(x = mpg, y = hp))+geom_point(aes(color=factor(cyl),
size=5))+facet_grid(.~cyl)+coord_flip()
• Tidyr :
• future500$Revenue <- gsub("$",“=", future500$Revenue : Replacing $ symbol with = symbol in
all the Revenue entries.
• future500 <- separate(future500, State, into=c("State", "City"), sep = ",") : Separating “State” in 2
separate columns.
• Leaflet :
• leaflet() %>% addTiles() %>% addMarkers(lat = 40.7223, lng = -73.9887)
• OUTPUT : Displays map view of the location basis latitude and longitude entered.
27. • VIM :
• aggr(future500) : Calculate or plot the amount of missing/imputed values in each variable of the
dataset.
• Substr() Command :
• substr(iris$Species, 1, 3) : Extract or replace first 3 substrings in Species variable.
• Ifelse() Command :
• iris$Species_Num <- ifelse(iris$Species=="setosa", 1, ifelse(iris$Species=="versicolor", 2, 3))
• Revalue() Command :
• iris$Species <- revalue(iris$Species, c("setosa"=1,"versicolor"=2,"virginica"=3))
• Recode() Command :
• iris$Species <- recode(iris$Species, "c('setosa')='Type-1'")
• Writing a CSV File :
• write.csv(iris, file = "Iris Dataset.csv", row.names = F)
28. • Strsplit() Command :
• strsplit(future500$State, ",") :Splits the data whenever it sees , separator. Variable must be
character type only.
• Regression :
• summary(lm(target_variable~., data=dataset)) : Target Variable must be converted to numeric
class. Higher the Adjusted R^2, better the model.
• Impute() :
• imputeddata <- impute(future500, classes = list(factor = imputeMode(), numeric =
imputeMedian()))
• future500 <- imputeddata$data
• Na.Roughfix :
• future500<- na.roughfix(future500) : Impute missing values by Median/Mode.
• Skewness and Kurtosis :
• skewness(iris$Sepal.Length) : Gives skewness of the numeric variable.
• kurtosis(iris$Sepal.Width) : Gives kurtosis of the numeric variable.
29. • Splitting Iris Data : [caret Package]
index = createDataPartition(iris$Species, p=0.5, list = F)
train <- iris[index,]
test <- iris[-index,]
• Splitting Iris Data : [caTools Package]
split.data <- sample.split(iris$Species, SplitRatio = 0.75)
training_set <- subset(iris, split.data==T)
test_set <- subset(iris, split.data==F)
• Train Dataset – Data having predictor variables and the target variable.
• Test Dataset - Model is tested over this data for accuracy.
• combined<- bind_rows(train, test) : Dataset on which Feature Engineering is performed.
• train_new and test_new are 2 datasets extracted from combined having same dimensions as
train and test data with no missing values(except Target Variable in test_new) & additional
features.
31. • Setting Seed :
• set.seed(123) : Makes the selected sample STATIC, any number can be written.
This code line is written every time before running a model or before sampling.
• Decision Tree :
• model_dtree <- rpart(target_variable~., data = train_new, method = "class“, control =
rpart.control(minsplit =60, minbucket = 30, maxdepth = 4)) [Method can be class or anova]
• Visualizing Decision Tree :
• plot(as.party(model_dtree))
• Predicting Decision Tree Outcomes :
• predict_dtree <- predict(model_dtree, test_new, type = "class")
• Creating Confusion Matrix :
• DtreeCM <- confusionMatrix(predict_dtree, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeCM$overall[1]))
• Altering 3 parameters : minsplit, minbucket & maxdepth of a Decision tree : Hyper-Parametric
Tuning
33. • Linear Regression :
• model_lm <- lm(target_variable~., data = train_new)
• Visualizing Linear Regression Model :
• par(mfrow=c(2,2)) ; plot(model_lm)
• Predicting Linear Regression Outcomes :
• predict_lm <- predict(model_lm, test_new, type = "response")
• Checking Root Mean Square Error (RMSE) :
rmse(actual_data$target_variable, predict_lm) : Lower the RMSE, better the model.
• Logistic Regression : [Used for Classification Problem]
• model_glm <- glm(target_variable~., data = train_new)
• Visualizing Logistic Regression Model :
• par(mfrow=c(2,2)) ; plot(model_glm)
• Predicting Logistic Regression Outcomes :
• predict_glm <- predict(model_glm, test_new, type = “response”)
• Checking Accuracy & Kappa Value:
• confusionMatrix(actual_data$target_variable, predict_glm) : Higher the Kappa, better the model.
34. • k-NN Algorithm :
• k-NN model runs only for numerical variables, therefore we remove all categorical columns while
building the model.
• model.knn <- knn(train = training_set[,-5], test = test_set[,-5], cl = training_set[,5], k = 10)
• 5th column of Iris dataset being a factor type is excluded while building model.
• Tuning k-NN :
• summary(tune.knn(x=training_set[,-5], y=training_set[,5], k=1:20))
• k-Means Clustering :
• Forming Cluster :
• sepalclusters <- kmeans(iris[,1:2],3,nstart = 20)
• table(sepalclusters$cluster, iris$Species)
• Visualizing Clusters via Animation :
• kmeans.ani(iris[,1:2],centers = 3)