2. R
• What is R?
• Programming language meant for statistical analysis, data mining
• https://en.wikipedia.org/wiki/R_(programming_language)
• Why R?
• Effective data manipulation, Storage and graphical display
• Free of cost, open source
• Many packages contributed by experienced programmers/ statisticians
• https://cran.r-project.org/web/packages/available_packages_by_name.html
• Simple and elegant code, easy to learn
• Microsoft is integrating R in SQL server
• Problems:
• Memory management : data sits on RAM
• Speed
• Many developments are happening to address these problems.
Eswar Sai Santosh Bandaru
7. General Things:
• Case sensitive
• Shortcuts:
• CTRL+ENTER (Important): Send code from editor to console and execute
• CTRL+2: Move the console from editor to console
• CTRL+1: MOVE the cursor from console to editor
• CTRL+UP IN CONSOLE: Retrieve previous commands
• # hash is used for commenting the code
• CTRL+SHIFT+C: comment/uncomment a block of code
Eswar Sai Santosh Bandaru
9. Assignments and Expression
• “<-” is the assignment operator in R
• a<-3, 3 gets assigned to variable a
• Expressions
• Combination of numbers/variables/operators
• E.g., 2+3*a/14
• Order of Evaluation:
• ORDER OF EVALUATION: BRACKETS -> EXPONENTIATION-> DIVISION ->
MULTILICATION -> ADDITION/SUBTRACTION
• E.g., 7*9/13 - 10.1111
• -2^0.5 -- -1.414
• (-2) ^0.5 - NaN
• Q1
Eswar Sai Santosh Bandaru
10. Data Types
• Numeric: Real Numbers. E.g., 1.24, -3.12, 1
• Integer: Integer values. Suffix L is added
• Character: E.g., ‘a’ , “a”, “Hello World!”, “2”
• Logical: Boolean Type. TRUE (1), FALSE(0), T, F
• Complex: a+bi . a,b are real numbers
• Class(): function is used to check the class
• E.g., class(24) -- numeric
• E.g., class(24L)-- integer
Eswar Sai Santosh Bandaru
11. Data structures
• 4 main types:
• Vector
• Matrices
• Lists
• Data frames
• We would discuss vectors and data frames in today’s session
Eswar Sai Santosh Bandaru
12. Vectors:
• One dimension collection of objects of same kind (same data type)
• Vectors in R are similar to arrays in any other programming language
• Syntax: (1,2,3,4,5) . 1,2,3,4,5 are called elements
• (1,2,3,4,5) : numeric vector
• (‘a’,’b’,’c’,’d’): character vector
• (T, F, T, T): logical vector
• (1L,2L,3L): integer vector
• (1,2,3,4,6) ----- valid vector
• (1,’a’,3,’t’) ------ invalid vector (but R doesn’t throw an error due to
coercion
Eswar Sai Santosh Bandaru
13. Creating
• Basic ways:
• Using c()
• Using “:”
• Using seq()
• Using rep()
• Using vector()
Eswar Sai Santosh Bandaru
14. C() combine function
• Syntax:
• X<- C(1,2,4,78,90) creates a Numeric vector X with elements 1,2,4,78,90
• Y<- c(‘a’,’b’,’c’,’d’) creates a character vector Y with elements ‘a’, ‘b’, ‘c’,’d’
• Printing:
• X # Auto printing
• Print(x) # explicit printing
Eswar Sai Santosh Bandaru
15. Using “:”
• x <- 20:50
• Creates a numeric vector x with values starting from 20 till 50 with increments
of 1
• Ending value > Starting Value - default increment +1
• y <- 50:20
• Creates a numeric vector x with values starting from 50 till 20 with increments
of -1
• Ending value < Starting Value .- default increment -1
Eswar Sai Santosh Bandaru
16. Seq()
• X <- seq(2,50)
• Creates a numeric vector starting from 2 till 50 with increment of +1
• X <- seq(50,2)
• Creates a numeric vector starting from 50 till 2 with increment of -1
• X <- seq(2,50,2)
• Creates a numeric vector starting from 2 till 50 with increment of +2
• Increment can also be –ve if starting element > ending element
• ( 2, 4,6,8,10…….,50)
• X<- seq(‘a’,’b’,2) Throws an error
Eswar Sai Santosh Bandaru
17. Rep()
• X <- rep(c(1,2,3),times =2)
• Creates vector numeric vector X: 1,2,3,1,2,3
• The vector gets repeated twice
• rep(1:3, each =2)
• Output: 1,1,2,2,3,3
• Each element in the vector gets repeated twice
• rep(1:3,each=2,times =3)
• Output: 1,1,2,2,3,3, 1,1,2,2,3,3, 1,1,2,2,3,3,
• 2 steps
• 1:Each element gets repeated twice
• 2: the entire vector itself gets repeated thrice
• Different variations of rep-- ?rep
Eswar Sai Santosh Bandaru
18. Combining vectors
• X <-c(1,2,3,4,5)
• Y<-c(1,6,7,8)
• Z<-c(X,Y)
• Combines vectors X,Y and assigns to Z, output: 1,2,3,4,5,1,6,7,8
• Q1 – Q8
Eswar Sai Santosh Bandaru
23. Subsetting vectors
X<-( ‘a’ , ‘b’, ‘c’, ‘d’, ‘e’, ‘f’)
Index: 1 2 3 4 5 6
X[1:3]: ‘a’ ‘b’ ‘c’
Not same as x[3:1]
Prints first
three
elements
Eswar Sai Santosh Bandaru
28. Recycling
• 1:5 + 1
• Internally 1,2,3,4,5 + 1,1,1,1,1 (1 gets recycled 5 times to match the length of
longer vector, then element wise operation occurs)
• 1:6 + c(1,2)
• Internally 1,2,3,4,5,6 + 1,2,1,2,1,2 (c(1,2) gets recycled to meet the length of
longer vector)
• C(1,2,3,4,5,6,7) + c(1,2,3,4) ( a warning !!)
• 1,2,3,4,5,6,7 + 1,2,3,4,1,2,3
Eswar Sai Santosh Bandaru
29. Q12: Create vector q using element wise
operations
Eswar Sai Santosh Bandaru
30. Subsetting a vector with logical vector
• Y <- c('a','b','c','d')
• Y[c(T,T,F,T)]
• ‘a’ ‘b’ ‘d’(selects the element if true else does not select)
• Recycling
• Y[c(T)]
• Vector T gets recycled till it matches the length of Y
• Every element gets printed
Eswar Sai Santosh Bandaru
31. Comparison operators
• X<- c(1,2,3,4,5,6,7)
• X>4 (x greater than 4)
• Outputs a logical vector having True for values greater than 4 and false for
values less than or equal to false
• Output: logical vector : F,F,F,F,T,T,T
• X[X>4]
• Selects elements from X which are greater than 4
• Output: 5,6,7
Eswar Sai Santosh Bandaru
32. Conditional operators in R
• conditional statements in R
• x == y : checks for equality, outputs TRUE if equal else FALSE
• x !=y : checks for inequality
• x >=y: greater than or equal
• x <=y
• x<y
• x>y
• You can combine both of them using & , or operators
• Q13-Q16
Eswar Sai Santosh Bandaru
33. Coercion
• x <- c(1,2,'a',3) -- Does not throw an error
• Other elements in the vector gets coerced to character
• Output: ‘1’,’2’,’a’,’3’
• priority for coercion; character> numeric> logical
• Logical converts to 1,0
• explicit coercion:
• as.* function s
• as.character (1:20) # customerID
• X<-c(‘a’,’b’,’c’,’d’)
• as.numeric(x)--- R produced NA’s
• Output: NA, NA, NA, NA
Eswar Sai Santosh Bandaru
34. Some important functions
• Which() : produces the indices of vector the condition is satisfied
• X <- c(10,2,4,5,0)
• Which(x>2)
• Output: 1, 3, 4
• all() : produces a logical vector if a condition is satisfied by all values in
a vector
• all(x>2): False
• any(): produces a logical vector if a condition is satisfied in any values
in a vector
• Any(x>2) :TRUE
Eswar Sai Santosh Bandaru
35. attributes
• Attributes: Give additional information about elements of a vector
• E.g., names of elements, dimensions, levels
• attributes(x) : shows all the available attributes of x
• If there are no attributes, r outputs NULL
• We can assign attributes to a created vector
• E.g., we can assign names to elements with function name()
• names(x) <- student_names
• Where student names is character vector containing names of students
Eswar Sai Santosh Bandaru
36. Subsetting using names attribute
• X[‘Cory’] -- prints marks of Cory
• Internally…using which() , R gets the index whose attribute name is “Cory”
• Then subsets based on the index
• X[c(‘Cory’,’James’)] - prints marks of Cory and James
• Q16
Eswar Sai Santosh Bandaru
37. Updating a vector: What if Cory’s marks get
updated
• X[1] <- 35
• Element at index 1 gets updated to 35
• X[x<30 &&x>25] <-40
• All the values which are less than 30 updated to 40
• X[“Cory”] <- 67
Eswar Sai Santosh Bandaru
38. is.na() and mean imputation
• x<- c(1,2,4,NA,5,NA)
• is.na(x): produces a logical vector, TRUE if element is NA else FALSE
• Output: F F F T F T
• Replace NA with the mean values????
Eswar Sai Santosh Bandaru
39. Factors attribute
• Converts a continuous vector in to a categorical data
• X<-c(1,1,1,2,2,2,3,3,3)
• Sum(x) : 18
• X<-factors(X)
• Sum(x) : error
• Levels(x): categories in x
• Output: “1” “2” “3”
• Class(X)
• Output: factor
Eswar Sai Santosh Bandaru
40. Table function: frequency table
• Counts the number of times an element occurs in vector
• X<-c(‘a’,’a’,’a’,’b’,’b’,’c’,’c’)
• table(x):
• a-3
• b-2
• c-2
• Useful while plotting barplot
Eswar Sai Santosh Bandaru
41. ls() and rm()
• ls() : Lists all the objects in the current R session(environment)
• rm(“d”) : removes the object d
• rm( list = ls()): removes all objects from the environment
Eswar Sai Santosh Bandaru
42. Data frames:
• Data frames are simply “tables” (rows and columns)
• Each column should be of same data type (hence all the vector
operations are valid for each column)
• Creation
• X<- data.frame(data for column1, data for column 2,…….)
• Column gets binded
• 2 dimensional
Eswar Sai Santosh Bandaru
43. Subsetting data frames…why?
• Very useful for analyzing the data
• As it 2 dimensional, it has 2 indices : row * columns
• test[3,2] : refers to element in 3rd row 2nd column
• test[1:3,1:2]: first three rows, 2 columns
• Using column names
• test$student_name : refers to column: student_name
• Its kind of vector!...so we can perform all vector operations
• test["student_name"] : refers to column student_name
• test["marks"]
Eswar Sai Santosh Bandaru
44. Students with higher than average marks??
• above_average<- (test$marks>mean(test$marks))
• test$student_names[above_average]
• Two steps:
• above_average is a logical vector
• Test$student_names[above_average] selecting students where the vector is
True
Eswar Sai Santosh Bandaru
45. Writing into csv
• Write.csv(test,”test.csv”)
• Gets saved to the default directory(folder) R is pointing to
• To know the default directory:
• Use getwd()
Eswar Sai Santosh Bandaru
46. Reading a csv file
• setwd(“directory path”)
• read.csv(“file name”)
• Different function to read different files
• dir() : lists all files in the current directory
Eswar Sai Santosh Bandaru
48. Dates and Times in R
• Dates are stored internally as the number of days since 1970-01-01
while times are stored internally as the number of seconds since
1970-01-01
Eswar Sai Santosh Bandaru
49. Data Visualization in R: Using R base graphics
• 3 types:
• base graphics
• ggplot2
• lattice
• Boxplots
• Barplots
• Histograms
• Scatter plots
Eswar Sai Santosh Bandaru