3. # lists current objects, let’s check the all the object created in R
using below syntax
ls()
# Removes specified objects
rm(“a”)
# Removes all objects
rm(list=ls())
# Import inbuilt data set in R
data(mtcars)
# View object in new screen
View(mtcars)
# Starts empty GUI spreadsheet editor for manual data entry.
x = edit(data.frame())
# Returns an object's attribute list.
attributes(mtcars)
# Returns the dimensions of vectors, arrays or dataframes
dim(mtcars)
# Lists content of current working directory
dir()
Some Key….KEY
4. R has five basic or ‘atomic’ classes of objects. Wait, what is an object ?
• Everything you see or create in R is an object.
• A vector, matrix, data frame, even a variable is an object.
• R has 5 basic classes of objects. This includes:
1. Character
2. Numeric (Real Numbers)
3. Integer (Whole Numbers)
4. Complex
5. Logical (True / False)
Think of attributes as their ‘identifier’, a name or number which identifies them. An object can have following attributes:
1.names, dimension names
2.dimensions
3.class
4.Length
• Attributes of an object can be accessed using attributes() function.
• The most basic object in R is known as vector.
• You can create an empty vector using vector().
Remember, a vector contains object of same class.
For example: Let’s create vectors of different classes. We can create vector using c() or concatenate command also.
> a <- c(1.8, 4.5) #numeric
> b <- c(1 + 2i, 3 - 6i) #complex
> c <- c(23, 44) #integer
> d <- c(T,F,TRUE,FALSE) #logical
Essential of R Programming
5. List:
A list is a special type of vector which contain elements of different data types.
For example:
> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list
[[1]]
[1] 22
[[2]]
[1] "ab"
[[3]]
[1] TRUE
[[4]]
[1] 1+2i
As you can see, the output of a list is different from a vector.
This is because, all the objects are of different types.
The double bracket [[1]] shows the index of first element and so on. Hence, you can easily extract the element of lists
depending on their index. Like this:
> my_list[[3]]
> [1] TRUE
You can use [] single bracket too. But, that would return the list element with its index number, instead of the result above.
Like this:
> my_list[3]
> [[1]]
[1] TRUE
Data Type - R
6. .
Vector:
• Contains object of same class.
• you can mix objects of different classes too.
• coercion occurs, when different classes are mixed
• coercion means ‘convert’ different class into one class.
For example:
> qt <- c("Time", 24, "October", TRUE, 3.33) #character
> ab <- c(TRUE, 24) #numeric
> cd <- c(2.5, "May") #character
To check the class of any object, use class(“vector
name”) function.
> class(qt)
"character"
To convert the class of a vector, you can use as. command.
>bar <- 0:5
> class(bar)
> "integer"
> as.numeric(bar)
> class(bar)
> "numeric"
> as.character(bar)
> class(bar)
> "character“
• Similarly, you can change the class of any vector.
• If you try to convert a “character” vector to “numeric” ,
NAs will be introduced.
R has various type of ‘data types’ which includes vector (numeric, integer etc), matrices, data frames and list. Let’s
understand them one by one
Data Type - R
7. Matrices:
When a vector is introduced with row and column i.e. a dimension attribute, it becomes a matrix. A matrix is represented by
set of rows and columns. It is a 2 dimensional data structure. It consist of elements of same class. Let’s create a matrix of 3
rows and 2 columns:
> my_matrix <- matrix(1:6, nrow=3, ncol=2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> dim(my_matrix)
[1] 3 2
> attributes(my_matrix)
$dim
[1] 3 2
As you can see, the dimensions of a matrix can be obtained using either dim() or attributes()command. To extract a
particular element from a matrix, simply use the index shown above. For example(try this at your end):
> my_matrix[,2] #extracts second column
> my_matrix[,1] #extracts first column
> my_matrix[2,] #extracts second row
> my_matrix[1,] #extracts first row
As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign dimension dim() later. Like
this:
Data Type - R
8. Data Frame:
This is the most commonly used member of data
types family. It is used to store tabular data. It is
different from matrix. In a matrix, every element
must have same class. But, in a data frame, you can
put list of vectors containing different classes. This
means, every column of a data frame acts like a list.
Every time you will read data in R, it will be stored
in the form of a data frame. Hence, it is important
to understand the majorly used commands on data
frame:
> df <- data.frame(name =
c("ash","jane","paul","mark"), score = c(67,56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91
> dim(df)
[1] 4 2
str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91
> nrow(df)
[1] 4
> ncol(df)
[1] 2
Let’s understand the code above. df is the name of data frame.
• dim() returns the dimension of data frame as 4 rows and 2
columns.
• str() returns the structure of a data frame i.e. the list of variables
stored in the data frame.
• nrow() and ncol() return the number of rows and number of
columns in a data set respectively.
you see “name” is a factor variable and “score” is numeric.
In data science, a variable can be categorized into two types:
• Continuous and
• Categorical.
Data Type - R
9. Let’s Create a Vector, Metrics, Dataframe and List
• Create list of anything you want
• Create a numeric vector 1 to 5
• Create character vector a to d
• Create Metrics consist 10 rows and 10 Columns
• Create Dataframe of three column to the variable df
• First Column name ‘alpha’ and value will be from a to e
• Second column name ‘numeric’ and value will be from 1 to 5
• Third column name ‘missing’ and value will be (4,5,2,NA,NA)
Data Type – R : Exercise
10. Control structure ‘controls’ the flow of code / commands written inside a function.
A function is a set of multiple commands written to automate a repetitive coding task.
For example:
You have 10 data sets. You want to find the mean of ‘Age’ column present in every data set. This can be done in 2 ways: either
you write the code to compute mean 10 times or you simply create a function and pass the data set to it.
Let’s understand the control structures in R with simple examples:
if, else – This structure is used to test a condition. Below is the syntax:
if (<condition>){
##do something
} else {
##do something
}
Example
#initialize a variable
N <- 10
#check if this variable * 5 is > 40
if (N * 5 > 40){
print("This is easy!")
} else {
print ("It's not easy!")
}
[1] "This is easy!"
Control Structure
11. for:
This structure is used when a loop is to be
executed fixed number of times.
It is commonly used for iterating over the
elements of an object (list, vector). Below is the
syntax:
for (<search condition>){
#do something
}
Example
#initialize a vector
y <- c(99,45,34,65,76,23)
#print the first 4 numbers of this vector
for(i in 1:4){
print (y[i])
}
[1] 99
[1] 45
[1] 34
[1] 65
while:
It begins by testing a condition, and executes only if the condition is found to
be true.
Once the loop is executed, the condition is tested again. Hence, it’s necessary
to alter the condition such that the loop doesn’t go infinity. Below is the
syntax:
while (<search condition>){
#do something
}
Example
#initialize a condition
Age <- 12
#check if age is less than 17
while(Age < 17){
print(Age)
Age <- Age + 1 #Once the loop is executed, this code breaks the loop
}
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16
Control Structure
12. Let’s Install all required Packages using Loop and If Condition
# Creating List of packages
packages <- list("XLConnect","RODBC","tm","rvest","data.table", "networkD3","webshot","ggplot2")
# Creating loop to run through list
for( package in packages){
# if condition, to check if package is already installed
if (package %in% installed.packages()[,1]){
print(paste(package,"available"))
}else{
install.packages(package, dependencies = TRUE, repos="http://cran.rstudio.com/")
}
}
Control Structure : Exercise
13. Let’s now understand the concept of missing values in R. This is one of the most painful yet crucial part of predictive
modeling. You must be aware of all techniques to deal with them.
Missing values in R are represented by NA and NaN. Now we’ll check if a data set has missing values (using the same data
frame df).
Create dataframe with NA:
> df = data.frame(name=c("a","b","c","d","e","f"),score=c(1,2,3,4,NA,NA))
> df
name score
1 a 1
2 b 2
3 c 3
4 d 4
5 e NA
6 f NA
> is.na(df) #checks the entire data set for NAs and return logical output
name score
name score
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE TRUE
[6,] FALSE TRUE
Missing Data
14. > table(is.na(df)) #returns a table of logical output
FALSE TRUE
10 2
> df[!complete.cases(df),] #returns the list of rows having missing values
name score
5 e NA
6 f NA
Missing values hinder normal calculations in a data set. For example, let’s say, we want to compute the mean of score. Since
there are two missing values, it can’t be done directly. Let’s see:
mean(df$score)
[1] NA
> mean(df$score, na.rm = TRUE)
[1] 2.5
The use of na.rm = TRUE parameter tells R to ignore the NAs and compute the mean of remaining values in the selected column
(score). To remove rows with NA values in a data frame, you can use na.omit:
> new_df <- na.omit(df)
> new_df
name score
1 a 1
2 b 2
3 c 3
4 d 4
Missing Data
15. Data frame created in Data Type session, we have ‘NA’ value in ‘missing’ column
First print data frame, type df in console
Now use:
• is.na(dataframe) # on data frame
• table(is.na(dataframe) # to know if there is any NA in data frame
• na.omit(dataframe) # to remove na from data frame
Missing Data : Exercise
16. # Create a folder at some location
# Set working directory to the folder
setwd(enter_path_here)
From EXCEL:
#install.packages('XLConnect', dependencies=TRUE, repos='http://cran.rstudio.com/’)
library(XLConnect)
wb = loadWorkbook(“AAPL.xlsx")
ExcelData = readWorksheet(wb, sheet = "Sheet1", header = TRUE)
From CSV:
CSVData <- read.csv(file="AAPL.csv", header=TRUE, sep=",")
From Database (SQL Database):
#install.packages("RODBC", dependencies=TRUE, repos='http://cran.rstudio.com/')
library(RODBC)
conn <- odbcDriverConnect('driver={SQL Server};server=PRANAVSQLEXPRESS;database=CentralMI;trusted_connection=true’)
SQLData <- sqlQuery(conn, "SELECT * FROM requestdetail;")
close(conn)
Read Data
18. Steps to Manipulate:
1) Import data
2) Sub-setting data (removing unwanted data)
3) Selecting required column
4) Selecting required row
5) Merging other data for mapping
6) Grouping / aggregate
7) Exporting data to various output
(1) Import Data
(3) Sub-setting data (removing unwanted data)
# now we required cars having 3 gear
table1 <- subset(table,table[,gear==3])
(3) Selecting required columns:
# We will select columns from 1 to 3 and 10 to 12
table2 <- table1[,c(1:3,10:12)]
(5) Merging other data for mapping
# Create mapping dataframe
mapping <- data.frame(carb=c(1,2,3,5),name=c("a","a","b“,”c”))
Inner join:
merge(x = table3, y = mapping)
Outer join:
merge(x = table3, y = mapping, by='carb', all = TRUE)
Left outer:
merge(x = table3, y = mapping, by='carb', all.x = TRUE)
Right outer:
merge(x = table3, y = mapping, by='carb', all.y = TRUE)
Cross Join:
merge(x = table3, y = mapping, by=NULL)
# Let’s create inner join and assign to variable named ‘table4’
table4 <- merge(x = table3, y = mapping)
data(mtcars)
# let’s understand data, have name.
head(mtcars)
# You will find first column do not have name, So
will import library
library(data.table)
table <- data.table(mtcars)
head(table)
(4) Selecting required rows:
# We will select columns from 1 to 3 and 10 to 12
table3<- table2[c(1:10),]
Manipulate Data
19. (6) Grouping/ Aggregate
# Sum of ‘mpg’ column on newly merged column ‘name’,
table5 <- aggregate(table4$mpg, by=list(Category=table4$name), FUN=sum)
# Max of ‘mpg’ column on newly merged column ‘name’,
Max <- aggregate(table4$mpg, by=list(Category=table4$name), FUN=max)
# Min of ‘mpg’ column on newly merged column ‘name’
Min <- aggregate(table4$mpg, by=list(Category=table4$name), FUN=min)
# Mean of ‘mpg’ column on newly merged column ‘name’,
Mean <- aggregate(table4$mpg, by=list(Category=table4$name), FUN=mean)
# Standard Deviation of ‘mpg’ column on newly merged column ‘name’,
Sd <- ggregate(table4$mpg, by=list(Category=table4$name), FUN=sd)
# Standard Deviation of ‘mpg’ column on newly merged column ‘name’,
Median <- aggregate(table4$mpg, by=list(Category=table4$name), FUN=median)
# Standard Deviation of ‘mpg’ column on newly merged column ‘name’,
Summary <- summary(table4)
(7) Exporting data in CSV file:
write.csv(Summary, file = "F:R Programmingtest.csv", row.names = FALSE)
Manipulate Data
20. To CSV:
write.csv(x, file = "F:R Programmingtest.csv", row.names = FALSE)
To Excel:
#install.packages('XLConnect', dependencies=TRUE, repos='http://cran.rstudio.com/’)
library(XLConnect)
wb = loadWorkbook(“AAPL1.xlsx")
writeWorksheetToFile(wb , data = data , sheet = "sheet1", startRow = 1, startCol = 1)
Write Data
21. The different parts of a function are −
Function Name − This is the actual name of the function. It is stored in R environment as an object with this name.
Arguments − An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are
optional; that is, a function may contain no arguments. Also arguments can have default values.
Function Body − The function body contains a collection of statements that defines what the function does.
Return Value − The return value of a function is the last expression in the function body to be evaluated.
R has many in-built functions which can be directly called in the program without defining them first. We can also create and
use our own functions referred as user defined functions.
Function
# Create a function with arguments.
myfunction <- function(a,b,c) {
result <- a * b + c
print(result)
}
# Call the function by position of arguments.
myfunction(10,11,22)
# function to take user input
readinteger <- function(){
n <- readline(prompt="Enter an integer: ")
return(as.integer(n))
}
print(readinteger()) # print input
# syntax of function
myfunction <- function(arg1, arg2, ... ){
statements
return(object)
}
22. function1 = function(path='F://R
Programming',filename='AAPL.xlsx’,outputfilename='AAPL1.xlsx’,sheetname=‘Sheet1',skiprows=0,groupby=none,aggregate=sum)
{
# joining path and file name
filedetail = paste(path, "/",filename,sep="")
outputdetail = paste(path, "/", outputfilename,sep="")
# if condition to check, package is already installed
if('XLConnect' %in% installed.packages()[,1]) { }else{install.packages('XLConnect', dependencies=TRUE, repos='http://cran.rstudio.com/’)}
# importing library
library(XLConnect)
# Loading workbook
wb = loadWorkbook(filedetail)
# Reading data from workbook
df = readWorksheet(wb , sheet = sheetname, header = TRUE)
# select few top records for display
head(df)
# Summary of dataset
summary = summary(df)
# write to Excel file
writeWorksheetToFile(outputdetail , data = summary , sheet = "sheet1", startRow = 3, startCol = 4)
}
#Calling function
function1()
Function – Build Own Function
23. setwd()
# Let’s create chart with built in data set
# Step 1) Import data
data(mtcars)
# load library
library(data.table)
# use data.table library to get name of first column
table = data.table(mtcars, keep.rownames=TRUE)
# select 1st and 2nd column for charts
table1 = table[,1:2]
# Give the chart file a name
png(file = "barchart_Cars_mpg.png")
# data is ready for charts
h <- table1$mpg
y <- table1$rn
x <- table1$mpg
barplot(h,xlab="Cars",ylab="mpg",names.arg=y,,col="red",main="
Cars-mpg",border="red")
dev.off()
#syntax of barplot
barplot(H,xlab,ylab,main, names.arg,col)
• H is a vector or matrix containing numeric values used
in bar chart.
• xlab is the label for x axis.
• ylab is the label for y axis.
• main is the title of the bar chart.
• names.arg is a vector of names appearing under each
bar.
• col is used to give colors to the bars in the graph.
Charts
24. if('networkD3' %in% installed.packages()[,1]) { }else{install.packages('networkD3', dependencies=TRUE, repos='http://cran.rstudio.com/')}
if('webshot' %in% installed.packages()[,1]) { }else{install.packages('webshot', dependencies=TRUE, repos='http://cran.rstudio.com/')}
# Load package
library(networkD3)
library(webshot)
# create data:
set.seed(101)
links=data.frame(
source=c("A","A", "A", "A", "A","J", "B", "B", "C", "C", "D","I"),
target=c("B","B", "C", "D", "J","A","E", "F", "G", "H", "I","I")
)
# Plot
graph = simpleNetwork(links,
Source = 1, # column number of source
Target = 2, # column number of target
height = 480, # height of frame area in pixels
width = 480,
linkDistance = 120, # distance between node. Increase this value to have more space between nodes
charge = -480, # numeric value indicating either the strength of the node repulsion (negative value) or attraction (positive value)
fontSize = 22, # size of the node names
fontFamily = "serif", # font og node names
linkColour = "#666", # colour of edges, MUST be a common colour for the whole graph
nodeColour = "red", # colour of nodes, MUST be a common colour for the whole graph
opacity = 0.8, # opacity of nodes. 0=transparent. 1=no transparency
zoom = T # Can you zoom on the figure?)
saveNetwork(graph,file = '#252_interactive_network_chart1.html',selfcontained = T)
Network chart with D3 using R – Sample 1
25. # libraries
library(networkD3)
# Load data
data(MisLinks)
data(MisNodes)
# Plot
forceNetwork(
Links = MisLinks, Nodes = MisNodes,
Source = "source", Target = "target",
Value = "value", NodeID = "name",
Group = "group", opacity = 0.8,
linkDistance = JS('function(){d3.select("body").style("background-color", "#DAE3F9"); return 50;}')
)
Network chart with D3 using R – Sample 2