SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
R
Programming
Presentation By :
Aman Bhalla
(+91 8700246920)
(amanbhalla017@gmail.com)
AGENDA
• Introduction to R
• Packages Covered
• Datasets Covered
• Basics of R
• Looping in R
• Data Analysis in R
• Machine Learning Algorithms
Introduction to R
• A Programming Language & free software
environment for statistical computing.
• Most popular Graphical User Interface(GUI),
Widely used among Statisticians & Data
Miners for developing statistical software
and data analysis.
• Highly Extensible through the use of user-
submitted packages for specific functions.
R Script
R Console
Global Environment
R differs from RStudio. One can use R without using RStudio, but can't use RStudio without
using R, so R comes first.
Plots, Packages & Help Tab
Shortcuts Used
• Ctrl + L : Clears R Console.
• Alt + - : Assigns a name to a variable.
• Ctrl + Shift + M : Assigns Pipe Operator (%>%)
• Ctrl + Shift + N : Opens a new R Script.
• Ctrl + O : Opening an existing R Script.
• Ctrl + S : Saving the current R Script.
• Ctrl + Q : Quits the current R Session.
Packages
Covered
• Amelia : A program for missing data.
• animation : A Gallery of animations in Statistics & Utilities to create animations.
• car : Companion to Applied Regression.
• caret : Classification and Regression Training.
• caTools : Tools : Moving window statistics, GIF, Base64, ROC AUC etc.
• class : Functions for Classification.
• corrplot : Visualization of a Correlation Matrix.
• cowplot : Streamlined Plot Theme and Plot Annotations for ‘ggplot2’.
• dplyr : A Grammar of Data Manipulation.
• e1071 : Misc Functions of the Department of Statistics, Probability Theory.
• ggplot2 : Create Elegant Data Visualisations using the Grammar of Graphics.
• ggplot2movies : Movies Data
• ggrepel : Automatically position non-overlapping text labels with ggplot2.
• hflights : Flights that departed Houston in 2011.
• leaflet : Create Interactive Web Maps with the JavaScript ‘Leaflet’ library.
• magrittr : A Forward-Pipe Operator for R.
• Metrics : Evaluation Metrics for Machine Learning.
• mlr : Machine Learning in R.
• partykit : A Toolkit for Recursive Partytioning.
• plyr : Tools for Splitting, Applying and Combining Data.
• randomForest : Breiman & Cutler’s Random Forests for classification & regression.
• rpart : Recursive Partitioning and Regression Trees.
• scales : Scale Functions for Visualization.
• tibble : Simple Data Frames.
• tidyr : Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions.
• VIM : Visualization and Imputation of Missing values.
Datasets
Covered
• In-Built datasets like iris, mtcars, state.x77 are also used extensively apart from the following mentioned datasets :
1. Bike Data : A 121 x 9 dataset to understand basic indexing in R.
2. Future 500 : A 500 x 10 dataset to understand data visualization.
3. Movies Data : A 58788 x 24 dataset that comes under ggplot2movies package.
4. Weather (Australia) : A 144187 x 24 dataset to explore tools of dplyr package.
5. Flights Data : A 227496 x 21 dataset that comes under hflights package.
6. Meterology Data : Weather datasets of 4 cities in US to understand looping in R.
7. Big Mart Sales Data : A dataset to understand Exploratory Data Analysis in R.
8. Basketball, Wine Quality Data : Datasets to understand Decision Tree mechanism.
9. Titanic, Home Credit Data : Datasets downloaded from Kaggle to understand ML Basics.
10. Income Data : A dataset to further practice Supervised Machine Learning Algorithms.
11. Social Network Ads Data : A 400 x 5 dataset to understand k-NN Algorithm.
12. Uber Data : A dataset showing Month-wise details of customers boarding cabs & their location.
Basics
• weekend <- c(“Sat", “Sun") : Saves a variable named weekend in the environment containing 2 character
values : Sat, Sun
• data <- read.csv(file.choose(),header = T,sep = ",",na.strings = c(“ “)) : Standard Code for Reading csv file in R.
• class(data) : States the class type of the dataset.
• str(data) : Displays the structure/class of all the variables within the data.
• summary(data) : Shows the minimum, maximum, mean, median, 1st & 3rd Quartiles respectively.
• df <- as.data.frame(data) : Stores data into a data frame.
• matx <- as.matrix(data) : Stores data into a matrix.
• getwd() : Displays the working directory.
• rownames(data) : Shows all the row names of the data.
• colnames(data) : Shows all the column names of the data.
• nrow(data) : Shows the count of number of rows in the dataset.
• ncol(data) : Counts the number of columns in the dataset.
• length(data) : Shows the length (variables count) of the dataset.
• install.packages(“packagename”) : Installing a package in R
• library(package) : Activate a package to further perform functions.
• names(data) : Displays all the variables of the dataset.
• dim(data) : Shows the dimensions of the data (No. of rows & columns).
• sum(is.na(data)) : Shows the total NA values present in the dataset.
• attach(df) : Attaching a dataset named df in R.
• detach(df) : Detaching a dataset named df.
• head(data) : Prints first 6 rows of the dataset.
• tail(data) : Prints last 6 rows of the dataset.
• print(tibble, n=20) : Prints first 20 rows of the dataset.
• data$variable<-as.character(data$variable) : Saving the variable as character type.
• data[1,] : Prints 1st row of the dataset.
• data[,3] : Prints 3rd column of the dataset.
• table(data$variable) : Tabular view of the variable (Frequency Distribution).
• round(mean(data$variable,na.rm = T),2) : Rounds off the mean of variable(removing NA values) upto 2 decimals.
• sort(prop.table(data$variable),decreasing=T) : Tabular view of proportions of variable totals in descending order.
• complete.cases(variable) : ! is.na(variable) : Boolean Output (True/False) of Non-NA Values(Shows True for
Non-NA Values).
• list1 <- list(mtcars, iris) : Stores a variable named list1 containing list of 2 mentioned datasets in the global
environment.
• which(iris$Species == "setosa") : Displays all the row numbers where the given criteria is satisfied.
• rm(list = ls()) : Clears all the stored datasets and variables from the global environment.
Looping
A <- c("what","is","truth")
• COMMAND :
if("Truth" %in% A){
print("Truth is found")
}else{
print("Not Found")
}
• OUTPUT :
Not Found
A <- c(1:10)
• COMMAND :
for(i in 1:3){
print(A)
}
• OUTPUT :
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
apply(mtcars, 1, mean)
• OUTPUT :
Displays mean of all variables row-wise.
lapply(weather, t) [6th dataset (9th Slide)]
• OUTPUT :
Displays transpose of data in the form of a list.
lapply(weather, "[",1) : Access 1st column of all
datasets in the list.
• Creating a Function :
missingvalue <- function(x){
return(sum(is.na(x)))
}
sapply(weather, missingvalue)
• OUTPUT :
Displays count of missing values in the form of a
table.
• Alternative Code for the same task :
sapply(weather, function(x) sum(is.na(x)))
Data
Analysis
1. First step to deal with a typical business problem is by making hypothesis, then performing
Exploratory Data Analysis up till Step 6.
2. Data will be received - Extract all the Variables/Features from the data.
3. Understanding the Data in Detail - Finding Patterns, Outliers, Anomalies & Missing Values.
4. Performing Univariate Analysis - Single Variable Analysis.
5. Conduct Multivariate Analysis - Categorical Vs. Categorical, Numerical Vs. Categorical,
Numerical Vs. Numerical Variables.
6. Missing Values Imputation, Treatment of Outliers & Anomalies (if any).
7. Apply Feature Engineering - Variable Transformation.
8. Perform Scaling of the Data - Bringing the Data near to Normal Distribution.
9. Apply ML Algorithm, make predictions & test the Accuracy.
• Pipe Operator :
%>% : Passes object on left hand side as first argument of function on right hand side.
E.G : iris %>% names() : States all the variable names of the iris dataset.
• Tibble :
iris<- tbl_df(iris) : Converts data to tibble class, which are easier to examine than data frames.
• Select() :
iris %>% select(Sepal.Length, Petal.Length, Species) : Select columns by name
• Filter() :
filter(mtcars, cyl>4 & gear>4) : Extract rows that meet logical criteria.
• Rename() :
iris <- iris %>% rename(Sepal_Length = Sepal.Length) : Renaming a Variable.
• Select_If() :
iris.num <- iris %>% select_if(is.numeric) : Extracting the data basis condition.
• Match Operator ( %in% ) :
8 %in% c(1,2,9,5,3,6,7,4,5) : OUTPUT : FALSE
iris %>% filter(Species %in% c("setosa", "virginica")) %>% summary() : Summarises the data where
the condition is TRUE.
• Helper Functions :
select(iris, contains(".")) : Select columns whose name contains a character string.
select(iris, starts_with("Sepal")) : Select columns whose name starts with a character string.
select(iris, ends_with("Length")) : Select columns whose name ends with a character string.
select(iris, everything()) : Select every column.
select(iris, Sepal.Length:Petal.Width) : Select all columns between Sepal.Length and Petal.Width
(inclusive).
select(iris, -Species) : Select all columns except Species.
iris<- iris %>% select(Sepal_Length = Sepal.Length) : Renaming a variable name.
• Arrange() :
hflights %>% arrange(DepDelay) : Arranges the data in ascending order basis DepDelay.
hflights %>% arrange(desc(DepDelay)) : Arranges the data in descending order basis DepDelay.
• Group_By() :
iris %>% group_by(Species) : Group data into rows with the same value of Species.
• Summarise() :
iris %>% group_by(Species) %>% summarise(Count=n()) : Summarises the count of Species in
data.
hflights %>% group_by(Month) %>% summarise(dist = sum(Distance)) %>% summarise(
Minimum_Distance = min(dist), Maximum_Distance = max(dist), Mean_Distance = mean(dist),
Median_Distance = median(dist)) : Summarises the min., max., mean and median of total monthly
distances.
• Tally() :
hflights %>% group_by(Month) %>% tally(sort = T) : Adds a frequency column to the table.
• Mutate() :
mutate(iris, Sepal = Sepal.Length + Sepal.Width) : Compute and append one or more new columns.
• Transmute() :
transmute(iris, sepal = Sepal.Length + Sepal. Width) :Compute one or more new columns. Drop
original columns.
• Slice() :
slice(hflights, 100:106) : Slice rows by position.
• Distinct() :
iris %>% select(Species) %>% distinct() : Removes duplicates.
• If_Else() :
df <- data.frame(x=c(1,NA,6,5))
df <- df %>% mutate(New_Variable = if_else(x<5, x+1, x+2, 0)) : Last argument is to replace
missing values (NA).
• Union() :
union(y, z) : union(mtcars[1:16,], mtcars[17:32,]) : Rows that appear in either or both y and z.
• Intersect() :
intersect(y, z) : intersect(mtcars[1:16,], mtcars[16:32,]) : Rows that appear in both y and z.
• Between() :
hflights %>% filter(between(DepTime,600,605)) : Displays rows with variable value(DepTime) lying
between specified values.
• Count() :
hflights %>% count(Month, sort = T) : A shortcut that does grouping as well as creating a frequency
table.
• Bind_Rows() :
bind_rows(y, z) : bind_rows(mtcars[1:16,], mtcars[17:32,]) : Append z to y as new rows.
• User Input :
number <- as.integer(readline(prompt("Enter the number")))
• Scatter Plot (Along with Smooth Line):
ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5, shape=7, alpha=0.5) +
labs(title=“Scatter-Plot”, subtitle=“Mpg Vs. Hp”, x=“mpg”, y=“hp”) + geom_smooth(fill = NA, size = 1.5, method
= lm, color= "blue") + theme_bw()
• Box Plot :
ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot(aes(fill=factor(cyl)), alpha = 0.75) +
scale_fill_discrete(name = "Cyl") + labs(title = “Box-Plot”, x="Cyl", y="Mpg") + theme_classic()
• Bar Plot :
iris %>% group_by(Species) %>% summarise(Count=n()) %>% ggplot(aes(Species, Count)) +
geom_bar(stat="identity", fill = "green") + labs(title = “Bar-Plot”, x=“Species”, y=“Count”) +
geom_label(aes(Species, Count, label = Count)) + theme(axis.text.x = element_text(angle=45, hjust = 1))
• Histogram :
ggplot(movies, aes(rating)) + geom_histogram(aes(fill= ..count..), binwidth = 0.1) + ggtitle("Histogram") +
xlab("Ratings") + ylab("Count") + theme_minimal()
• Density Plot :
• ggplot(movies, aes(rating)) + geom_density(color = "red") + labs(title="Density Plot",
subtitle="Movies Data", x="Ratings", y="Density", caption="Source : Movies Data") + theme_grey()
• Heat Map (Bin2d Map) :
• ggplot(movies, aes(x=year, y=rating)) + geom_bin2d() + labs(title = "Heat Map", subtitle = "Year
Vs. Rating", x="Year", y="Rating", caption="Source : Movies Data") + theme_classic()
• Violin Plot :
• iris %>% ggplot(aes(Species, Sepal.Length)) + geom_violin(fill = "red", alpha = 0.75) + labs(title =
"Violin-Plot", subtitle = "Species Vs. Sepal.Length", caption = "Source : Iris Data")
• Correlation Plot :
• corrplot(cor(mtcars),method = "circle") : Works with data containing all numeric variables only.
• Cowplot :
• plots <- plot_grid(A, B, nrow = 1) : Here, A, B and C are 3 stored plots respectively.
• plot_grid(plots, C, ncol = 1) : This command will plot all 3 plots together.
• GG Repel :
• ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5) +
geom_text_repel(aes(label = rownames(mtcars)), color='blue') + theme_minimal()
• Faceting & Flipping :
• Facet divides a plot into sub-plots.
• ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))+ geom_point(aes(color=Species, size=Species))
+ facet_wrap(~Species)
• ggplot(mtcars, aes(x = mpg, y = hp))+geom_point(aes(color=factor(cyl),
size=5))+facet_grid(.~cyl)+coord_flip()
• Tidyr :
• future500$Revenue <- gsub("$",“=", future500$Revenue : Replacing $ symbol with = symbol in
all the Revenue entries.
• future500 <- separate(future500, State, into=c("State", "City"), sep = ",") : Separating “State” in 2
separate columns.
• Leaflet :
• leaflet() %>% addTiles() %>% addMarkers(lat = 40.7223, lng = -73.9887)
• OUTPUT : Displays map view of the location basis latitude and longitude entered.
• VIM :
• aggr(future500) : Calculate or plot the amount of missing/imputed values in each variable of the
dataset.
• Substr() Command :
• substr(iris$Species, 1, 3) : Extract or replace first 3 substrings in Species variable.
• Ifelse() Command :
• iris$Species_Num <- ifelse(iris$Species=="setosa", 1, ifelse(iris$Species=="versicolor", 2, 3))
• Revalue() Command :
• iris$Species <- revalue(iris$Species, c("setosa"=1,"versicolor"=2,"virginica"=3))
• Recode() Command :
• iris$Species <- recode(iris$Species, "c('setosa')='Type-1'")
• Writing a CSV File :
• write.csv(iris, file = "Iris Dataset.csv", row.names = F)
• Strsplit() Command :
• strsplit(future500$State, ",") :Splits the data whenever it sees , separator. Variable must be
character type only.
• Regression :
• summary(lm(target_variable~., data=dataset)) : Target Variable must be converted to numeric
class. Higher the Adjusted R^2, better the model.
• Impute() :
• imputeddata <- impute(future500, classes = list(factor = imputeMode(), numeric =
imputeMedian()))
• future500 <- imputeddata$data
• Na.Roughfix :
• future500<- na.roughfix(future500) : Impute missing values by Median/Mode.
• Skewness and Kurtosis :
• skewness(iris$Sepal.Length) : Gives skewness of the numeric variable.
• kurtosis(iris$Sepal.Width) : Gives kurtosis of the numeric variable.
• Splitting Iris Data : [caret Package]
index = createDataPartition(iris$Species, p=0.5, list = F)
train <- iris[index,]
test <- iris[-index,]
• Splitting Iris Data : [caTools Package]
split.data <- sample.split(iris$Species, SplitRatio = 0.75)
training_set <- subset(iris, split.data==T)
test_set <- subset(iris, split.data==F)
• Train Dataset – Data having predictor variables and the target variable.
• Test Dataset - Model is tested over this data for accuracy.
• combined<- bind_rows(train, test) : Dataset on which Feature Engineering is performed.
• train_new and test_new are 2 datasets extracted from combined having same dimensions as
train and test data with no missing values(except Target Variable in test_new) & additional
features.
Machine
Learning
Algorithms
• Setting Seed :
• set.seed(123) : Makes the selected sample STATIC, any number can be written.
This code line is written every time before running a model or before sampling.
• Decision Tree :
• model_dtree <- rpart(target_variable~., data = train_new, method = "class“, control =
rpart.control(minsplit =60, minbucket = 30, maxdepth = 4)) [Method can be class or anova]
• Visualizing Decision Tree :
• plot(as.party(model_dtree))
• Predicting Decision Tree Outcomes :
• predict_dtree <- predict(model_dtree, test_new, type = "class")
• Creating Confusion Matrix :
• DtreeCM <- confusionMatrix(predict_dtree, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeCM$overall[1]))
• Altering 3 parameters : minsplit, minbucket & maxdepth of a Decision tree : Hyper-Parametric
Tuning
• Random Forest :
• model_rf <- randomForest(factor(target_variable) ~.,data = train_new, method="rf")
• Predicting Random Forest Outcomes :
• predict_rf <- predict(model_rf, test_new, type = "response")
• Creating Confusion Matrix :
• DtreeRF <- confusionMatrix(predict_rf, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeRF$overall[1]))
• C Forest :
• model_cf <- cforest(as.factor(target_variable)~., data = train_new)
• Predicting C Forest Outcomes :
• predict_cf <- predict(model_cf, test_new, type = "response", OOB = T)
• Creating Confusion Matrix :
• DtreeCF <- confusionMatrix(predict_cf, actual_data$target_variable)
• Checking Accuracy :
• percent(as.numeric(DtreeCF$overall[1]))
• Linear Regression :
• model_lm <- lm(target_variable~., data = train_new)
• Visualizing Linear Regression Model :
• par(mfrow=c(2,2)) ; plot(model_lm)
• Predicting Linear Regression Outcomes :
• predict_lm <- predict(model_lm, test_new, type = "response")
• Checking Root Mean Square Error (RMSE) :
rmse(actual_data$target_variable, predict_lm) : Lower the RMSE, better the model.
• Logistic Regression : [Used for Classification Problem]
• model_glm <- glm(target_variable~., data = train_new)
• Visualizing Logistic Regression Model :
• par(mfrow=c(2,2)) ; plot(model_glm)
• Predicting Logistic Regression Outcomes :
• predict_glm <- predict(model_glm, test_new, type = “response”)
• Checking Accuracy & Kappa Value:
• confusionMatrix(actual_data$target_variable, predict_glm) : Higher the Kappa, better the model.
• k-NN Algorithm :
• k-NN model runs only for numerical variables, therefore we remove all categorical columns while
building the model.
• model.knn <- knn(train = training_set[,-5], test = test_set[,-5], cl = training_set[,5], k = 10)
• 5th column of Iris dataset being a factor type is excluded while building model.
• Tuning k-NN :
• summary(tune.knn(x=training_set[,-5], y=training_set[,5], k=1:20))
• k-Means Clustering :
• Forming Cluster :
• sepalclusters <- kmeans(iris[,1:2],3,nstart = 20)
• table(sepalclusters$cluster, iris$Species)
• Visualizing Clusters via Animation :
• kmeans.ani(iris[,1:2],centers = 3)

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R LanguageGaurang Dobariya
 
R basics
R basicsR basics
R basicsFAO
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Guy Lebanon
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiUnmesh Baile
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factorskrishna singh
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial ProgrammingSakthi Dasans
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with RShareThis
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data VisualizationSakthi Dasans
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programmingAlberto Labarga
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 

Was ist angesagt? (20)

Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Programming in R
Programming in RProgramming in R
Programming in R
 
R language introduction
R language introductionR language introduction
R language introduction
 
R basics
R basicsR basics
R basics
 
Data Analysis with R (combined slides)
Data Analysis with R (combined slides)Data Analysis with R (combined slides)
Data Analysis with R (combined slides)
 
Best corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbaiBest corporate-r-programming-training-in-mumbai
Best corporate-r-programming-training-in-mumbai
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Language R
Language RLanguage R
Language R
 
2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
R programming language
R programming languageR programming language
R programming language
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
2 R Tutorial Programming
2 R Tutorial Programming2 R Tutorial Programming
2 R Tutorial Programming
 
Data analysis with R
Data analysis with RData analysis with R
Data analysis with R
 
5 R Tutorial Data Visualization
5 R Tutorial Data Visualization5 R Tutorial Data Visualization
5 R Tutorial Data Visualization
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 

Ähnlich wie R programming & Machine Learning

Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSonaCharles2
 
How to obtain and install R.ppt
How to obtain and install R.pptHow to obtain and install R.ppt
How to obtain and install R.pptrajalakshmi5921
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxdataKarthik
 
Week-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxWeek-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxhelzerpatrina
 
Advanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAdvanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAnshika865276
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesWork-Bench
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in RFlorian Uhlitz
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
R programming slides
R  programming slidesR  programming slides
R programming slidesPankaj Saini
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data scienceLong Nguyen
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptxkarthikks82
 

Ähnlich wie R programming & Machine Learning (20)

Data Exploration in R.pptx
Data Exploration in R.pptxData Exploration in R.pptx
Data Exploration in R.pptx
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
Slides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MDSlides on introduction to R by ArinBasu MD
Slides on introduction to R by ArinBasu MD
 
17641.ppt
17641.ppt17641.ppt
17641.ppt
 
How to obtain and install R.ppt
How to obtain and install R.pptHow to obtain and install R.ppt
How to obtain and install R.ppt
 
Aggregate.pptx
Aggregate.pptxAggregate.pptx
Aggregate.pptx
 
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptxfINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
 
R Introduction
R IntroductionR Introduction
R Introduction
 
Week-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docxWeek-3 – System RSupplemental material1Recap •.docx
Week-3 – System RSupplemental material1Recap •.docx
 
Advanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.pptAdvanced Data Analytics with R Programming.ppt
Advanced Data Analytics with R Programming.ppt
 
Broom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data FramesBroom: Converting Statistical Models to Tidy Data Frames
Broom: Converting Statistical Models to Tidy Data Frames
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
R Get Started I
R Get Started IR Get Started I
R Get Started I
 
Next Generation Programming in R
Next Generation Programming in RNext Generation Programming in R
Next Generation Programming in R
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Introduction to R for data science
Introduction to R for data scienceIntroduction to R for data science
Introduction to R for data science
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
 

Kürzlich hochgeladen

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Kürzlich hochgeladen (20)

Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

R programming & Machine Learning

  • 1. R Programming Presentation By : Aman Bhalla (+91 8700246920) (amanbhalla017@gmail.com)
  • 2. AGENDA • Introduction to R • Packages Covered • Datasets Covered • Basics of R • Looping in R • Data Analysis in R • Machine Learning Algorithms
  • 3. Introduction to R • A Programming Language & free software environment for statistical computing. • Most popular Graphical User Interface(GUI), Widely used among Statisticians & Data Miners for developing statistical software and data analysis. • Highly Extensible through the use of user- submitted packages for specific functions.
  • 4. R Script R Console Global Environment R differs from RStudio. One can use R without using RStudio, but can't use RStudio without using R, so R comes first. Plots, Packages & Help Tab
  • 5. Shortcuts Used • Ctrl + L : Clears R Console. • Alt + - : Assigns a name to a variable. • Ctrl + Shift + M : Assigns Pipe Operator (%>%) • Ctrl + Shift + N : Opens a new R Script. • Ctrl + O : Opening an existing R Script. • Ctrl + S : Saving the current R Script. • Ctrl + Q : Quits the current R Session.
  • 7. • Amelia : A program for missing data. • animation : A Gallery of animations in Statistics & Utilities to create animations. • car : Companion to Applied Regression. • caret : Classification and Regression Training. • caTools : Tools : Moving window statistics, GIF, Base64, ROC AUC etc. • class : Functions for Classification. • corrplot : Visualization of a Correlation Matrix. • cowplot : Streamlined Plot Theme and Plot Annotations for ‘ggplot2’. • dplyr : A Grammar of Data Manipulation. • e1071 : Misc Functions of the Department of Statistics, Probability Theory. • ggplot2 : Create Elegant Data Visualisations using the Grammar of Graphics. • ggplot2movies : Movies Data • ggrepel : Automatically position non-overlapping text labels with ggplot2.
  • 8. • hflights : Flights that departed Houston in 2011. • leaflet : Create Interactive Web Maps with the JavaScript ‘Leaflet’ library. • magrittr : A Forward-Pipe Operator for R. • Metrics : Evaluation Metrics for Machine Learning. • mlr : Machine Learning in R. • partykit : A Toolkit for Recursive Partytioning. • plyr : Tools for Splitting, Applying and Combining Data. • randomForest : Breiman & Cutler’s Random Forests for classification & regression. • rpart : Recursive Partitioning and Regression Trees. • scales : Scale Functions for Visualization. • tibble : Simple Data Frames. • tidyr : Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions. • VIM : Visualization and Imputation of Missing values.
  • 10. • In-Built datasets like iris, mtcars, state.x77 are also used extensively apart from the following mentioned datasets : 1. Bike Data : A 121 x 9 dataset to understand basic indexing in R. 2. Future 500 : A 500 x 10 dataset to understand data visualization. 3. Movies Data : A 58788 x 24 dataset that comes under ggplot2movies package. 4. Weather (Australia) : A 144187 x 24 dataset to explore tools of dplyr package. 5. Flights Data : A 227496 x 21 dataset that comes under hflights package. 6. Meterology Data : Weather datasets of 4 cities in US to understand looping in R. 7. Big Mart Sales Data : A dataset to understand Exploratory Data Analysis in R. 8. Basketball, Wine Quality Data : Datasets to understand Decision Tree mechanism. 9. Titanic, Home Credit Data : Datasets downloaded from Kaggle to understand ML Basics. 10. Income Data : A dataset to further practice Supervised Machine Learning Algorithms. 11. Social Network Ads Data : A 400 x 5 dataset to understand k-NN Algorithm. 12. Uber Data : A dataset showing Month-wise details of customers boarding cabs & their location.
  • 12. • weekend <- c(“Sat", “Sun") : Saves a variable named weekend in the environment containing 2 character values : Sat, Sun • data <- read.csv(file.choose(),header = T,sep = ",",na.strings = c(“ “)) : Standard Code for Reading csv file in R. • class(data) : States the class type of the dataset. • str(data) : Displays the structure/class of all the variables within the data. • summary(data) : Shows the minimum, maximum, mean, median, 1st & 3rd Quartiles respectively. • df <- as.data.frame(data) : Stores data into a data frame. • matx <- as.matrix(data) : Stores data into a matrix. • getwd() : Displays the working directory. • rownames(data) : Shows all the row names of the data. • colnames(data) : Shows all the column names of the data. • nrow(data) : Shows the count of number of rows in the dataset. • ncol(data) : Counts the number of columns in the dataset.
  • 13. • length(data) : Shows the length (variables count) of the dataset. • install.packages(“packagename”) : Installing a package in R • library(package) : Activate a package to further perform functions. • names(data) : Displays all the variables of the dataset. • dim(data) : Shows the dimensions of the data (No. of rows & columns). • sum(is.na(data)) : Shows the total NA values present in the dataset. • attach(df) : Attaching a dataset named df in R. • detach(df) : Detaching a dataset named df. • head(data) : Prints first 6 rows of the dataset. • tail(data) : Prints last 6 rows of the dataset. • print(tibble, n=20) : Prints first 20 rows of the dataset.
  • 14. • data$variable<-as.character(data$variable) : Saving the variable as character type. • data[1,] : Prints 1st row of the dataset. • data[,3] : Prints 3rd column of the dataset. • table(data$variable) : Tabular view of the variable (Frequency Distribution). • round(mean(data$variable,na.rm = T),2) : Rounds off the mean of variable(removing NA values) upto 2 decimals. • sort(prop.table(data$variable),decreasing=T) : Tabular view of proportions of variable totals in descending order. • complete.cases(variable) : ! is.na(variable) : Boolean Output (True/False) of Non-NA Values(Shows True for Non-NA Values). • list1 <- list(mtcars, iris) : Stores a variable named list1 containing list of 2 mentioned datasets in the global environment. • which(iris$Species == "setosa") : Displays all the row numbers where the given criteria is satisfied. • rm(list = ls()) : Clears all the stored datasets and variables from the global environment.
  • 16. A <- c("what","is","truth") • COMMAND : if("Truth" %in% A){ print("Truth is found") }else{ print("Not Found") } • OUTPUT : Not Found A <- c(1:10) • COMMAND : for(i in 1:3){ print(A) } • OUTPUT : 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 apply(mtcars, 1, mean) • OUTPUT : Displays mean of all variables row-wise. lapply(weather, t) [6th dataset (9th Slide)] • OUTPUT : Displays transpose of data in the form of a list. lapply(weather, "[",1) : Access 1st column of all datasets in the list. • Creating a Function : missingvalue <- function(x){ return(sum(is.na(x))) } sapply(weather, missingvalue) • OUTPUT : Displays count of missing values in the form of a table. • Alternative Code for the same task : sapply(weather, function(x) sum(is.na(x)))
  • 18. 1. First step to deal with a typical business problem is by making hypothesis, then performing Exploratory Data Analysis up till Step 6. 2. Data will be received - Extract all the Variables/Features from the data. 3. Understanding the Data in Detail - Finding Patterns, Outliers, Anomalies & Missing Values. 4. Performing Univariate Analysis - Single Variable Analysis. 5. Conduct Multivariate Analysis - Categorical Vs. Categorical, Numerical Vs. Categorical, Numerical Vs. Numerical Variables. 6. Missing Values Imputation, Treatment of Outliers & Anomalies (if any). 7. Apply Feature Engineering - Variable Transformation. 8. Perform Scaling of the Data - Bringing the Data near to Normal Distribution. 9. Apply ML Algorithm, make predictions & test the Accuracy.
  • 19. • Pipe Operator : %>% : Passes object on left hand side as first argument of function on right hand side. E.G : iris %>% names() : States all the variable names of the iris dataset. • Tibble : iris<- tbl_df(iris) : Converts data to tibble class, which are easier to examine than data frames. • Select() : iris %>% select(Sepal.Length, Petal.Length, Species) : Select columns by name • Filter() : filter(mtcars, cyl>4 & gear>4) : Extract rows that meet logical criteria. • Rename() : iris <- iris %>% rename(Sepal_Length = Sepal.Length) : Renaming a Variable.
  • 20. • Select_If() : iris.num <- iris %>% select_if(is.numeric) : Extracting the data basis condition. • Match Operator ( %in% ) : 8 %in% c(1,2,9,5,3,6,7,4,5) : OUTPUT : FALSE iris %>% filter(Species %in% c("setosa", "virginica")) %>% summary() : Summarises the data where the condition is TRUE. • Helper Functions : select(iris, contains(".")) : Select columns whose name contains a character string. select(iris, starts_with("Sepal")) : Select columns whose name starts with a character string. select(iris, ends_with("Length")) : Select columns whose name ends with a character string. select(iris, everything()) : Select every column. select(iris, Sepal.Length:Petal.Width) : Select all columns between Sepal.Length and Petal.Width (inclusive). select(iris, -Species) : Select all columns except Species. iris<- iris %>% select(Sepal_Length = Sepal.Length) : Renaming a variable name.
  • 21. • Arrange() : hflights %>% arrange(DepDelay) : Arranges the data in ascending order basis DepDelay. hflights %>% arrange(desc(DepDelay)) : Arranges the data in descending order basis DepDelay. • Group_By() : iris %>% group_by(Species) : Group data into rows with the same value of Species. • Summarise() : iris %>% group_by(Species) %>% summarise(Count=n()) : Summarises the count of Species in data. hflights %>% group_by(Month) %>% summarise(dist = sum(Distance)) %>% summarise( Minimum_Distance = min(dist), Maximum_Distance = max(dist), Mean_Distance = mean(dist), Median_Distance = median(dist)) : Summarises the min., max., mean and median of total monthly distances. • Tally() : hflights %>% group_by(Month) %>% tally(sort = T) : Adds a frequency column to the table.
  • 22. • Mutate() : mutate(iris, Sepal = Sepal.Length + Sepal.Width) : Compute and append one or more new columns. • Transmute() : transmute(iris, sepal = Sepal.Length + Sepal. Width) :Compute one or more new columns. Drop original columns. • Slice() : slice(hflights, 100:106) : Slice rows by position. • Distinct() : iris %>% select(Species) %>% distinct() : Removes duplicates. • If_Else() : df <- data.frame(x=c(1,NA,6,5)) df <- df %>% mutate(New_Variable = if_else(x<5, x+1, x+2, 0)) : Last argument is to replace missing values (NA).
  • 23. • Union() : union(y, z) : union(mtcars[1:16,], mtcars[17:32,]) : Rows that appear in either or both y and z. • Intersect() : intersect(y, z) : intersect(mtcars[1:16,], mtcars[16:32,]) : Rows that appear in both y and z. • Between() : hflights %>% filter(between(DepTime,600,605)) : Displays rows with variable value(DepTime) lying between specified values. • Count() : hflights %>% count(Month, sort = T) : A shortcut that does grouping as well as creating a frequency table. • Bind_Rows() : bind_rows(y, z) : bind_rows(mtcars[1:16,], mtcars[17:32,]) : Append z to y as new rows.
  • 24. • User Input : number <- as.integer(readline(prompt("Enter the number"))) • Scatter Plot (Along with Smooth Line): ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5, shape=7, alpha=0.5) + labs(title=“Scatter-Plot”, subtitle=“Mpg Vs. Hp”, x=“mpg”, y=“hp”) + geom_smooth(fill = NA, size = 1.5, method = lm, color= "blue") + theme_bw() • Box Plot : ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_boxplot(aes(fill=factor(cyl)), alpha = 0.75) + scale_fill_discrete(name = "Cyl") + labs(title = “Box-Plot”, x="Cyl", y="Mpg") + theme_classic() • Bar Plot : iris %>% group_by(Species) %>% summarise(Count=n()) %>% ggplot(aes(Species, Count)) + geom_bar(stat="identity", fill = "green") + labs(title = “Bar-Plot”, x=“Species”, y=“Count”) + geom_label(aes(Species, Count, label = Count)) + theme(axis.text.x = element_text(angle=45, hjust = 1)) • Histogram : ggplot(movies, aes(rating)) + geom_histogram(aes(fill= ..count..), binwidth = 0.1) + ggtitle("Histogram") + xlab("Ratings") + ylab("Count") + theme_minimal()
  • 25. • Density Plot : • ggplot(movies, aes(rating)) + geom_density(color = "red") + labs(title="Density Plot", subtitle="Movies Data", x="Ratings", y="Density", caption="Source : Movies Data") + theme_grey() • Heat Map (Bin2d Map) : • ggplot(movies, aes(x=year, y=rating)) + geom_bin2d() + labs(title = "Heat Map", subtitle = "Year Vs. Rating", x="Year", y="Rating", caption="Source : Movies Data") + theme_classic() • Violin Plot : • iris %>% ggplot(aes(Species, Sepal.Length)) + geom_violin(fill = "red", alpha = 0.75) + labs(title = "Violin-Plot", subtitle = "Species Vs. Sepal.Length", caption = "Source : Iris Data") • Correlation Plot : • corrplot(cor(mtcars),method = "circle") : Works with data containing all numeric variables only. • Cowplot : • plots <- plot_grid(A, B, nrow = 1) : Here, A, B and C are 3 stored plots respectively. • plot_grid(plots, C, ncol = 1) : This command will plot all 3 plots together.
  • 26. • GG Repel : • ggplot(mtcars, aes(x=mpg, y=hp)) + geom_point(color='red', size=5) + geom_text_repel(aes(label = rownames(mtcars)), color='blue') + theme_minimal() • Faceting & Flipping : • Facet divides a plot into sub-plots. • ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width))+ geom_point(aes(color=Species, size=Species)) + facet_wrap(~Species) • ggplot(mtcars, aes(x = mpg, y = hp))+geom_point(aes(color=factor(cyl), size=5))+facet_grid(.~cyl)+coord_flip() • Tidyr : • future500$Revenue <- gsub("$",“=", future500$Revenue : Replacing $ symbol with = symbol in all the Revenue entries. • future500 <- separate(future500, State, into=c("State", "City"), sep = ",") : Separating “State” in 2 separate columns. • Leaflet : • leaflet() %>% addTiles() %>% addMarkers(lat = 40.7223, lng = -73.9887) • OUTPUT : Displays map view of the location basis latitude and longitude entered.
  • 27. • VIM : • aggr(future500) : Calculate or plot the amount of missing/imputed values in each variable of the dataset. • Substr() Command : • substr(iris$Species, 1, 3) : Extract or replace first 3 substrings in Species variable. • Ifelse() Command : • iris$Species_Num <- ifelse(iris$Species=="setosa", 1, ifelse(iris$Species=="versicolor", 2, 3)) • Revalue() Command : • iris$Species <- revalue(iris$Species, c("setosa"=1,"versicolor"=2,"virginica"=3)) • Recode() Command : • iris$Species <- recode(iris$Species, "c('setosa')='Type-1'") • Writing a CSV File : • write.csv(iris, file = "Iris Dataset.csv", row.names = F)
  • 28. • Strsplit() Command : • strsplit(future500$State, ",") :Splits the data whenever it sees , separator. Variable must be character type only. • Regression : • summary(lm(target_variable~., data=dataset)) : Target Variable must be converted to numeric class. Higher the Adjusted R^2, better the model. • Impute() : • imputeddata <- impute(future500, classes = list(factor = imputeMode(), numeric = imputeMedian())) • future500 <- imputeddata$data • Na.Roughfix : • future500<- na.roughfix(future500) : Impute missing values by Median/Mode. • Skewness and Kurtosis : • skewness(iris$Sepal.Length) : Gives skewness of the numeric variable. • kurtosis(iris$Sepal.Width) : Gives kurtosis of the numeric variable.
  • 29. • Splitting Iris Data : [caret Package] index = createDataPartition(iris$Species, p=0.5, list = F) train <- iris[index,] test <- iris[-index,] • Splitting Iris Data : [caTools Package] split.data <- sample.split(iris$Species, SplitRatio = 0.75) training_set <- subset(iris, split.data==T) test_set <- subset(iris, split.data==F) • Train Dataset – Data having predictor variables and the target variable. • Test Dataset - Model is tested over this data for accuracy. • combined<- bind_rows(train, test) : Dataset on which Feature Engineering is performed. • train_new and test_new are 2 datasets extracted from combined having same dimensions as train and test data with no missing values(except Target Variable in test_new) & additional features.
  • 31. • Setting Seed : • set.seed(123) : Makes the selected sample STATIC, any number can be written. This code line is written every time before running a model or before sampling. • Decision Tree : • model_dtree <- rpart(target_variable~., data = train_new, method = "class“, control = rpart.control(minsplit =60, minbucket = 30, maxdepth = 4)) [Method can be class or anova] • Visualizing Decision Tree : • plot(as.party(model_dtree)) • Predicting Decision Tree Outcomes : • predict_dtree <- predict(model_dtree, test_new, type = "class") • Creating Confusion Matrix : • DtreeCM <- confusionMatrix(predict_dtree, actual_data$target_variable) • Checking Accuracy : • percent(as.numeric(DtreeCM$overall[1])) • Altering 3 parameters : minsplit, minbucket & maxdepth of a Decision tree : Hyper-Parametric Tuning
  • 32. • Random Forest : • model_rf <- randomForest(factor(target_variable) ~.,data = train_new, method="rf") • Predicting Random Forest Outcomes : • predict_rf <- predict(model_rf, test_new, type = "response") • Creating Confusion Matrix : • DtreeRF <- confusionMatrix(predict_rf, actual_data$target_variable) • Checking Accuracy : • percent(as.numeric(DtreeRF$overall[1])) • C Forest : • model_cf <- cforest(as.factor(target_variable)~., data = train_new) • Predicting C Forest Outcomes : • predict_cf <- predict(model_cf, test_new, type = "response", OOB = T) • Creating Confusion Matrix : • DtreeCF <- confusionMatrix(predict_cf, actual_data$target_variable) • Checking Accuracy : • percent(as.numeric(DtreeCF$overall[1]))
  • 33. • Linear Regression : • model_lm <- lm(target_variable~., data = train_new) • Visualizing Linear Regression Model : • par(mfrow=c(2,2)) ; plot(model_lm) • Predicting Linear Regression Outcomes : • predict_lm <- predict(model_lm, test_new, type = "response") • Checking Root Mean Square Error (RMSE) : rmse(actual_data$target_variable, predict_lm) : Lower the RMSE, better the model. • Logistic Regression : [Used for Classification Problem] • model_glm <- glm(target_variable~., data = train_new) • Visualizing Logistic Regression Model : • par(mfrow=c(2,2)) ; plot(model_glm) • Predicting Logistic Regression Outcomes : • predict_glm <- predict(model_glm, test_new, type = “response”) • Checking Accuracy & Kappa Value: • confusionMatrix(actual_data$target_variable, predict_glm) : Higher the Kappa, better the model.
  • 34. • k-NN Algorithm : • k-NN model runs only for numerical variables, therefore we remove all categorical columns while building the model. • model.knn <- knn(train = training_set[,-5], test = test_set[,-5], cl = training_set[,5], k = 10) • 5th column of Iris dataset being a factor type is excluded while building model. • Tuning k-NN : • summary(tune.knn(x=training_set[,-5], y=training_set[,5], k=1:20)) • k-Means Clustering : • Forming Cluster : • sepalclusters <- kmeans(iris[,1:2],3,nstart = 20) • table(sepalclusters$cluster, iris$Species) • Visualizing Clusters via Animation : • kmeans.ani(iris[,1:2],centers = 3)