2. Background
• 1991: Created by Ross Ihaka and Robert Gentleman
• 2000: R version 1.0.0 is released
• Latest version is 2.15.2 released in Oct „12
• R version 2.15.3 is scheduled to release in Mar „12
and 3.0.0 is scheduled to be released in Apr ‟13
• http://www.r-project.org (basic information about R)
• http://www.cran.r-project.org (base system and
additional packages)
• help() or ?help, help.search() or ??help
3. Background
• R is a free software environment for
statistical computing and graphics
• Very active and vibrant user community
• Graphical capabilities
• Physical memory
• Base R and around 4000 packages
1/21/2014
4. Introduction
memory.limit(): To find out maximum amount of available
physical memory
memory.size(): To find out how much memory is in use
getwd(): Shows the path of your current working directory
setwd(path): Allows you to set a new path for your current
working directory
dir(): List down all the files in your working directory
Program Editor (open, load, run, save)
• ls(): List all objects in your workspace
• rm(): Removes object from your workspace
5. Introduction
Commands to R are expressions (4/3) or assignments (x <- 4/3)
R is case sensitive
Everything in R is a object
Normally R objects are accessed by their names which is made up from letters,
and digits (0 to 9) or a period (“.”) in non-initial positions.
Every object has a class
R has 5 basic classes of objects
character
numeric (real numbers)
integers
complex
logical (True / False)
•
The most basic object is a vector
A vector can only contains objects of the same class
6. Background
•
Ex.
•
x <- 1 # assignment
Print(x) # explicit printing
X # auto printing
Ex.
•
Q.
•
x <- c(0.5, 0.6) # numeric
X <- c(TRUE, FALSE) # logical
X <- c(T, F) # logical
X <- c(“a”, “b”, “c”, “d”) # character
X <- 1:20 # integer
X <- c(1+0i, 2+4i) #complex
seq(from=1, to=10, by=1)
rep(c(1,2,3,4,5), times=2, each=2)
X <- c(1.7, “a”)
X <- c(TRUE, 2)
X <- c(“a”, TRUE)
When different objects are mixed in a vector, coercion occurs so that every element in the
vector is of same class
8. Introduction
• Ex.
X <- 0:6
X <- c(“a”, “b”, “c”)
X <- c(1, 2, 3)
Numbers in R generally treated as numeric (i.e. double precision
real numbers)
If you explicitly wants an integer then you need to specify L suffix
Special number Inf (1/0), it actually a real number, 1/Inf will give
you 0
Undefined value NaN (0/0) Not a Number, it can be though of as
missing value
• # indicates comments
9. Data Types
R objects can have attributes (attributes())
Class (class())
Length (length())
names (colnames for a matrix), dimnames (rownames, colnames for a matrix)
dimensions (dim())
other user defined attributes
Various data types in R
Vectors
Vector(mode, length)
•
Lists: Special type of vectors which can contain objects of different
classes.
x <- list(1,2,3,“a”,”b”,”c”)
x <- list(a=c(1,2,3), b=1:4, c=c(“a”,”b”,”c”))
10. Data Types
Matrix: vectors with dimension attribute. Dimension itself is an
integer vector of length 2 (nrow, ncol). Matrices are constructed
column wise.
m <- matrix(nrow=2, ncol=3)
m <- matrix(1:6, nrow=2, ncol=3)
x <- 1:3
y <- 10:12
cbind(x, y)
rbind(x,y)
Data frames (data.frame())
https://stat.ethz.ch/pipermail/rhelp/attachments/20101027/05a229bb/attachment.pl
Factors: Used for categorical data i.e. Male & Female or analyst,
senior analyst, manager etc.
x <- factor(c(“a”, “b”, “b”, “c”, “c”, “c”, “d”))
levels()
unclass(x)
levels([4:6])
Levels([4:6, drop=TRUE])
11. Date & Time
Converting a character variable to a date variable
as.Date(variable_name, input_format)
strptime(variable_name, input_format)
Output will be %Y-%m-%d %H:%M:%S
%Y: Year with century
%m: Month as decimal number (01-12)
%d: Day of the month as decimal number(01-31)
%H: Hrs as decimal numbers (00-23)
%M: Minutes as decimal numbers (00-59)
%S: Seconda as decimal numbers (00-59)
Converting a date variable to a character variable / formatting a date
variable
strftime(date_variable_name, output_format)
format(data_variable_name, output_format)
as.character(date_variable_name, output_format)
12. Sub-setting
[ always returns an object of the same class as the original; can be
used to select more than one element
[[ is used to extract elements of list or data frames; it can only be
used to extract single element and the class of the returned object
will not necessarily be a list or data frame
$ is used to extract elements of a list or data frames by names;
semantics are similar to [[
13. Operators
<: Less than
<=: Less than equals to
>: Greater than
>=: Greater than equals to
==: Exactly equals to
!=: Not equal to
| or II: OR
& or &&: AND
!: NOT
14. Some Examples
x <- c(“a”, “b”, “c”, “c”, “d”, “a”)
x[1], x[1:4], x[x > “a”], u <- x >”a”
x <- matrix(1:6,2,3)
x[1,2], x[1,], x[,1], x[1,2, drop=FALSE]
x <- list(var_1=c(1:10), var_2=c(“a”, “b”, “c”), var_3=0.6)
x[1], x[[1]], x$var_1
name <- “var_1”, x[name], x[[name]], x$name
x[c(1,3)], x[[c(1,3)]], x[[1]][[3]]
Produce a character vector containing var_1, var_2, var_3… var_999
Remove missing values from x <- c(1, 2, 3, NA, 4, 5, NA, 6)
y <- c(“a”, “b”, NA, NA, “c”, “d”, “e”, “f”), prepare a matrix containing
two columns x & y and does not have any missing value
What is the sum & mean of Wind for the observations which has
temperature greater then 60 & month equals to 5
How to create a new directory with a given name
15. Reading / Writing Data Set
Principle functions for reading data into R
read.table(), read.csv(): Used for reading tabular data
readLines(): For reading lines of a text file
source(): For reading in R code file
dget(): For reading in R code file
load(): For reading in saved workspaces
unserialize(): For reading single R objects in binary form
•
Principle functions for writing data to files
write.table()
writeLines()
dump()
dput()
save()
serialize()
16. Importing / Exporting Data
Read.table() is one of the most commonly used function for reading data.
Few important arguments;
file, name of the file to be read,
header, logical indicating if the file has a header line
sep, a string indicating how the columns are separated
colClasses, a character vector indicating class of each column in the dataset
nrows, the maximum number of rows to be read in the dataset
na.strings, a character vector of strings which are to be interpreted as NA values
comment.char, a character string indicating the comment character
skip, number of lines to skip from beginning
stringAsFactors, logical indicating should character variables be codes as factors
Write.table()
X, the object to be written, preferable a matrix or a data frame
File, path and name of the file to be created
Sep, a string indicating how the columns are separated
Row.names, col.names, logical indicating whether the row names or col names to be
written along with x
17. Data Summary / Manipulation
attach(x): For attaching a file
detach(x): For detaching a file
•
summary(x): For displaying summary statistics of a data set
•
str(x): For displaying summary statistics of a data set in a different
manner then summary()
•
sort(): For sorting a vector or factor
•
order(): For ordering along more than one variable
•
merge(): Merge two data frames by common columns or row names, or
do other versions of database join operations
•
cut(x, breaks, labels): Divides the range of x into intervals and codes
the values in x according to which interval they fall. The leftmost
interval corresponds to level one, the next leftmost to level two and
so on.
cut(x, 10, 1:10)
18. Data Summary / Manipulation
•
pretty(x, n): Compute a sequence of about n+1 equally spaced „round‟
values which cover the range of the values in x.
pretty(x, 100)
•
substr(x, start, stop) <- value: Extract or replace substrings in a
character vector.
•
strsplit(): Split the elements of a character vector x into substrings
according to the matches to substring split within them.
•
rank(): Returns the sample ranks of the values in a vector. Ties (i.e.,
equal values) and missing values can be handled in several ways
•
aggregate(): Splits the data into subsets, computes summary statistics
for each, and returns the result in a convenient form.
ddply(): For each subset of a data frame, apply function then combine
results into a data frame.
19. Control Structures
Allows you to control the flow of execution of the program
if, else (testing a condition)
if (condition) {do something} else if {do something different} else {do something
different}
for (executing a loop fixed number of times)
for (i in 1:10) { do something}
while (executing a loop while a condition is true)
while (condition) { do something}
repeat (execute a infinite loop)
break (break the execution of a loop)
next (skip a iteration of a loop)
return (exit a function)
Create a vector with all integers from 1 to 1000 and replace all even
number by their inverse
20. Loop Functions
lapply: Returns a list of the same length as X, each element of which
is the result of applying FUN to the corresponding element of X
lapply(airquality, mean)
Calculate sum of all the variables of the airquality dataset excluding NAs
sapply: Sapply is a user-friendly version of lapply by default returning
a vector or matrix if appropriate
sapply(airquality, mean)
Repeat the problem present in lapply using sapply and see the difference
apply: Returns a vector or array or list of values obtained by applying
a function to margins of an array or matrix
apply(airquality, 1, sum)
Calculate deciles including min and max of all the variables of the dataset
airquality excluding NAs
Calculate square of each element of a matrix with dimensions 10 & 2 and
entries 1 to 20
21. Loop Functions
tapply: Apply a function to each cell of a ragged array, that is to
each (non-empty) group of values given by a unique combination of
the levels of certain factors
tapply(airquality$Ozone, aiqruality$Month, sum)
Calculate sum of Ozone variable for observations having month equals
to 5
mapply: mapply is a multivariate version of sapply. mapply applies
FUN to the first elements of each argument, the second elements,
the third elements, and so on
mapply(rep, 1:4, 4:1)
Calculate sum of two lists with dimensions 10 & 2 and having entries 1
to 20, 101 to 120, 201 to 220 & 301 to 320
22. Plotting Functions
plot(x,y)
hist(x)
par()
pch: plotting symbol
lty: line type
lwd: line width
col: plotting color
las: axis label orientation
bg: background color
mar: margin size
oma: outer margin size
mfrow: number of plots per row, column (plots are filled row-wise)
mfcol: number of plots per row, column (plots are filled column-wise)
23. Plotting Functions
lines: add lines to the plot
points: add points to the plot
text: add text labels to the plot
title: add annotations to x, y axis labels, title, subtitle, outer
margin
mtext: add text to the margins of the plot
axis: adding axis ticks/labels
24. Functions
function ()
Exact match –> Partial match –> Positional match
Return value of a function is the last expression in the function body
to be evaluated
Functions can be nested, so that a function can be defined inside
another function
Functions can be passed as arguments to other functions
25. Debugging
• Primary tools for debugging functions in R
traceback: prints out the function call stack after an error occurs; does
nothing if there is no error
debug: flags a function for debug mode which allows you to step through
execution of a function one line at a time
browser: suspends the execution of a function whenever it is called and
puts the function in debug mode
trace: allows you to insert debugging code into a function at specific
places
recover: allows you to modify the error behavior so that you can browse
the function call stack
26. Debugging
Indications that something‟s is not right
message: a generic notification/diagnostic message produced by the
message function; execution of the function continues
warning: an indication that something is wrong but not necessarily
fatal produced by warning function‟ execution of the function
continues
error: an indication that a fatal problem has occurred produced by
stop function; execution stops
condition: a generic concept for indicating that something
unexpected can occur; programmers can create their own conditions