SlideShare ist ein Scribd-Unternehmen logo
1 von 103
Downloaden Sie, um offline zu lesen
R Programming for Data Science
Sovello Hildebrand Mgani
sovellohpmgani@gmail.com
2
Outline
●
History of R
●
Installation (Windows and Linux)
●
Data Types
●
Reading Data:
– Tabular
– Large datasets
●
Textual Data Formats
●
Subsetting:
– Lists, Matrices, Partial matching
– Removing missing values
3
Outline
●
Vectorized operations
●
Control Structures
– If-else
– For, while, repeat, next break
●
Functions
– Scoping
●
Dates and Times
●
Loop functions
– lapply, tapply, apply, mapply, split,
●
Simulation and profiling
– Generating random numbers, simulating a linear model, random sampling
●
Visualizations
4
History of R
● Originates from S language. S was initiated in
1976 as an internal statistical analysis
environment—originally implemented as
Fortran libraries
– History of S:
http://www.stat.bell-labs.com/S/history.html
● R development history:
– https://en.wikipedia.org/wiki/R_(programming_la
nguage
)
5
R and Statistics
●
R developed from S which is a statistical analysis
tool, and so is R
●
Its functionality is divided into modules
– Need to load a module for different functionalities
●
Has very sophisticated graphics capabilities than
most other statistical packages
●
Useful for interactive work: run from terminal
●
Contains a powerful programming language for
developing new tools
– Tools: for visualizations and analysis
6
Design of the R System
●
The “base” system, downloaded from CRAN
●
“All other stuff”
●
Packages in R
– The “base” has the base package required to run R
and has the most fundamental functions
– Other packages contained in the “base”. Need to load
these to be able to use them: utils, stats, datasets,
graphics, grDevices, tools, etc.
– Recommended packages: boot, class, cluster,
codetools, foreign, lattice, etc.
– Load packages with library(), or require()
7
R Resources
●
CRAN:
– http://cran.r-project.org
●
Quick-R: a book
– http://www.statmethods.net/
●
R bloggers (platform): not a social network
– R-Bloggers is about empowering bloggers to empower
other R users
– R-Bloggers.com is a blog aggregator of content
contributed by bloggers who write about R (in English)
– https://www.r-bloggers.com/
8
Installation of R: Ubuntu
●
Run from terminal:
– sudo apt-get install r-base r-base-dev
●
If this doesn’t work, then you need
– To add the repositories:
 sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a
/etc/apt/sources.list
– Add the keyring:
 gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
 gpg -a --export E084DAB9 | sudo apt-key add -
– Install R-Base
 sudo apt-get update; sudo apt-get install r-base r-base-dev
●
You can install from a PPA which has the most recent versions
– Add the PPA
 sudo add-apt-repository ppa:marutter/rrutter
– Install R-Base
 sudo apt-get update; sudo apt-get install r-base r-base-dev
9
Installation of R: Windows
● Visit CRAN
– https://cran.r-project.org/
● CRAN: Comprehensive R Archive Network
10
Installation of R: Windows
Click/Select Download R for Windows
11
Installation of R: Windows
Then click/select base or install R for the first time
12
Installation of R: Windows
● Then click/select Download R X.X.X for Windows
● After the download has finished, locate the
downloaded file and install.
13
RStudio: www.rstudio.com
14
RStudio: Introduction
●
RStudio is a set of integrated tools designed to
help you be more productive with R.
●
How?
– It includes a console,
– syntax-highlighting editor that supports direct
code execution,
– a variety of robust tools for

plotting,

viewing history,

debugging and

managing your workspace.
15
RStudio: Installation
● From the RStudio home page, go to Products
then select RStudio
– Then scroll down and click
Download RStudio Desktop
– Then click Download under RStudio Desktop
Personal License.
– Select RStudio for your platform. Clicking on the
link will download the file directly.
– Locate the file in your system Downloads folder
and start the installation.
16
RStudio: Parts
The Console is where you
write and run code
interactively
The Files tab shows all the files and folders in
your default workspace as if you were on a
PC/Mac window.
The Plots tab will show all your graphs.
The Packages tab will list a series of packages or
add-ons needed to run certain processes.
For additional info see the Help tab
The Environment tab shows all
the active objects
The History tab shows a list of
commands used so far
17
RStudio: Working Directory
● It is important to organize all files for a
particular project under one main/parent
directory
● A working directory in RStudio is where all
the files for a particular project are stored
● All paths used in the console to load data files
and scripts are relative to the working
directory.
18
●
To set the working directory:
– Start RStudio the same way you start other
programs in your computer
– From the File menu options select New Project then
select New Directory then Empty Project then type
the directory name (rprogramming) then under
create project as subdirectory of click Browse and
select Desktop
●
RStudio: Working Directory
19
R: Getting Started
●
A few basic commands to test them on the console
– getwd(): get current working directory
– setwd(“/path/to/directory”): set a working directory to the
specified path
– install.packages(“package_name”): install a package.
Requires internet connection
– library(package_name), require(package_name): load and
attach add-on packages
– ?object: provide documentation/help for an object. e.g. ?mtcars
– summary(object): provide a summary of an object like a dataset
e.g. summary(mtcars)
● Everytime you run library(package_name) and get an error
“there is no package called ‘package_name’”, you will need to
install it first then call library on it.
20
Data Visualizations in R: Introduction
● R has different systems (packages) for making
graphs (visualizations)
● For this case we are going to use ggplot2
which is more elegant and versatile compared
to many others. (ggvis, rgl, htmlwidgets,
googleVis, etc.)
● Ggplot2 is built upon the “
The Layered Grammar of Graphics”
21
Data Visualizations in R: Tidyverse
●
Tidyverse is a set of packages
– The packages work in harmony
 Reason: they share common data representations and API
design.
● The tidyverse package makes it easy to install and
load core packages from it in a single command
● To install run: install.packages(“tidyverse”)
● To use it run: library(tidyverse)which loads
tidyverse core packages: ggplot2, tibble, tidyr,
readr, purrr, and dplyr.
– Google each one of these packages to learn what they do
22
Data Visualizations: First Steps
● library(tidyverse) loads all the core packages from
tidyverse
● The library() function also tells any conflicts with base R
or other packages that arise from loading the named package.
● e.g. for this case filter() and lag() are functions from
tidyverse that conflict with similar functions from dplyr and
stats packages
● In this case you may need to call a function explicitly from a
package in the form. package::function()
● e.g. ggplot2::ggplot() calls the ggplot function from
ggplot2 package.
23
●
Which is more fuel efficient: cars with big
engines or cars with small engines?
●
The mpg data frame:
– Data Frame: is a rectangular collection of
variables in columns and observations in rows

The mpg data frame in ggplot2 contains observations
collected by the US Environment Protection Agency on
38 models of cars.
●
Run (from console) ?mpg to learn more about
the data set.
Data Visualizations: First Steps
24
First Steps Creating a ggplot
● To answer the question about fuel efficiency
plot fuel consumption (hwy: y-axis)
against engine size (displ: x-axis)
● See the magic of this command:
– ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
25
First Steps Creating a ggplot
> ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
A negative relationship between engine size (displ) and fuel efficiency (hwy) means
Cars with bigger engines use more fuel.
26
Creating a ggplot
●
In ggplot2,
– You begin with the function ggplot()

ggplot() creates a coordinate system that you can add layers onto.

The first argument is the data set that you are going to use for plotting
– To complete the graph add more layers to the coordinate
system created by ggplot()

geom_point() function adds a layer of points to plot (which creates a
scatter plot for this case)

Each function in ggplot2 takes a mapping argument which defines how
variables are mapped to visual properties.

The mapping argument is always paired with aes()
– The x and y arguments of aes() specify which variables to map to the x and y
axes.
– ggplot2 looks for the mapped variable in the data argument, in
this case, mpg
27
Creating a ggplot: Template
● A graphing template for ggplot
● You can get a list of <GEOM_FUNCTION>s by
following this link (http://docs.ggplot2.org/current/)
28
ggplot: Aesthetics Mappings
● Look at the graph and note the circled dots
● What is special with these big engine cars?
29
ggplot: Aesthetics
● Ggplot Aesthetic mappings can help answer the
question
● An aesthetic is a visual property of the objects in a
plot.
– These are things like size, shape or color of points.
●
You can therefore display a point in different ways by
changing the values of its aesthetic properties.
●
You can convey information about your data by
mapping the aesthetics in your plot to the variables in
your dataset.
– e.g. you can map the colors of your points to the class
variable to reveal the class of each car.
30
ggplot: Aesthetics
●
New plot with aesthetics for class:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
●
Try for year and manufacturer and look at the trends
31
ggplot: Aesthetics
● Other aesthetics:
– Size: for ordered variables, so each point reveals
its attribute size
– Alpha: controls the transparency of the points
– Shape: points will be of different shapes

Exercise: try plotting the same geom with these
different aesthetics
● ggplot2 takes care of selecting a reasonable
scale to use with the aesthetic and constructs
a legend
32
ggplot: Aesthetics
● The aesthetic properties of a geom can be set
manually.
– For example:
 ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
– Will set all points to blue
– Note color is outside the aes()
33
ggplot: Facets
34
●
When the data has categorical variables, it is
possible to split the plot into facets.
●
Facets are subplots that each displays a subset
of data.
●
To plot facets, with a single variable, use the
function facet_wrap(formula, …)
– formula is created with ~ variable-name
– formula is the name of a data structure in R, not a
synonym for equation.
– The variable (variable-name) should be discrete.
ggplot: Facets
35
ggplot: Facets
●
For example:
– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color=”red”) + facet_wrap(~ class, nrow = 3)
●
This will produce a plot for each element in mpg.class,
and the plot will display in three rows.
36
ggplot: Facets
● Can we facet the plot using two discrete variables:
● Do this:
– ?facet_grid
– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)
 In the plot, why do we have empty sub-plots?
●
37
ggplot: Facets
● Hack:
– With facet grid, what happens when you use a . at
the place of one variable?
– Is there an advantage of faceting over the color
aesthetic? Any disadvantages? What is the dataset
is very large?
– In facet_wrap() what do nrow or ncol do?
– When using facet_grid() put the variable with
more unique levels in the columns (RHS of
formula), why?
 Why doesn’t facet_grid() have nrow, and ncolumn

38
ggplot2::Geometric objects (geoms)
● These are the geometric objects used to represent the
data.
– e.g. bar geoms, point geoms, line geoms, smooth geoms,
etc.
● To change the geom in your plot, change the geom
function (geom_xxx())
●
For example:
– ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
– ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))
● Not every aesthetic works with every geom
– e.g. you can’t set a shape of a line but of a point
– Read: ?geom_point, ?geom_smooth
39
ggplot2: geoms
● ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
●
Try:
– ggplot(data = mpg) +
geom_line(mapping = aes(x = displ, y = hwy, linetype = drv))
40
ggplot2: geoms
● Plot:
– ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
– ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y – hwy, group = drv))
 What is the difference? Which is better?
Why?
41
Ggplot2: combined geoms
●
Can we use more than one geoms on the same
plot?
●
Try:
– ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
●
When using multiple geoms on the same plot you
can use global mappings:
– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()

Which makes the code easy to read and modify.
42
ggplot2: combined geoms
●
When you use global mappings and set some mappings in a geom function,
these mappings will be treated as local to this layer only.
●
For example:
– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
43
ggplot2: combined geoms
●
In the same way, you can specify different data
for each layer.
– Say you only want to fit a smooth line for one class of
cars
– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
– Hack:

can we plot more than one of the same
geom?
– Try a smooth geom with different car class
44
Ggplot2: combined geoms
45
Combined Geoms: exercise
46
Ggplot2: geoms
● How many geoms does ggplot2 have?
– Visit this page:
https://www.rstudio.com/resources/cheatsheets/

Look for Data Visualization Cheat Sheet
● ggplot2 extensions provide more geoms to use.
Take a look at available extensions from
this gallery (http://www.ggplot2-exts.org/gallery/)
●
47
ggplot2: statistical transformations
● Read: ?diamonds
– ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
– Where does count come from?
48
Statistical Transformations
● Some plots plot raw values
– e.g. scatterplots,
● Some plots use calculated values
– bar charts, histograms, and frequency polygons bin
your data and then plot bin counts, the number of
points that fall in each bin.
– smoothers fit a model to your data and then plot
predictions from the model. (Remember regression lines)
– boxplots compute a robust summary of the
distribution and then display a specially formatted
box.
–
49
Statistical Transformation
●
The algorithm used to calculate new values for a
graph is called a stat, (Statistical Transformation)
● You can check which stat is used by default by
looking at the default value of stat.
– geom_bar() uses count. Thus you can recreate the bar
chart by running
 ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
●
Every geom has a default stat; and vice-versa. This
means that you can typically use geoms without
worrying about the underlying statistical
transformation.
50
Statistical Transformation
● You can explicitly specify a stat:
●
When you want to override the default stat

e.g. Run
demo <- tribble(
~a, ~b,
"bar_1", 20,
"bar_2", 30,
"bar_3", 40
)

Then run
ggplot(data = demo) +
geom_bar(mapping = aes(x = a, y = b), stat = "identity")
51
Statistical Transformation
● Reasons to explicitly specify a stat: cntd
– You want to override the default mapping from transformed variables to aesthetics.
 ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
– This will draw a bar chart of proportion instead of count
52
Position Adjustments
● A bar chart can be colored in either of two
ways: color and fill.
– ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
– ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
53
Position Adjustments
● Check how the following plots will look like
– ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
– ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
– ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
– ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
– ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position =
"dodge")
54
Position Adjustments
● Learn more about position adjustments
– ?position_dodge,
– ?position_fill,
– ?position_identity,
– ?position_jitter
– ?position_stack
55
Position Adjustments:overplotting.
●
Recall: ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
– It displays fewer than 234 points: the number of observations (can you
count them?)
– The values of displ and hwy are rounded and many points overlap each
other. That is a problem called overplotting.
●
You can avoid this gridding by setting the position adjustment to
“jitter”
– position = “jitter” adds a small amount of random noise to each point
– Since no points can receive the same amount of noise, they are going to be
spread out.
●
Jittering makes the graph less accurate at small scales, however it
will make the graph more revealing at large scales.
● In ggplot2 the shorthand for geom_point(position =
"jitter") is geom_jitter()
56
Position Adjustments: jitter
● ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
57
Thank You! Asanteni!
58
Working with Data
● In this part we are going to learn how to work
with your data.
– Getting data

Importing your own data

Tidying data
– How to work with different data types:

Relational data,

Strings,

Factors,

Dates and Times
59
Importing Data
●
For importing files, we will use the readr package which
is part of the tidyverse core packages.
●
Most of readr functions turn flat files into data frames. A
Data Frame is a tabular data format with rows and
columns. It is a list of vectors of equal length.
– read_csv(): reads comma separated files
– read_csv2(): reads semicolon separated files
– read_tsv(): read tab delimited files
– read_delim(): reads files with any delimiter
●
Activity:
– Check what read_table(), read_fwf() and read_log()
do?
60
Importing Data: read_csv()
●
The first argument is the path to the file to read
– read_csv(“data/students.csv”)
●
read_csv() prints out a column specification
●
read_csv() by default uses the first row as the column names
– You can use skip = n, to skip the first n lines if they contain data you
don’t need, (most likely metadata)
– You can use comment = “#” to drop all lines that start with # for example
– Use col_names = FALSE so that read_csv() doesn’t treat the first row as
the column names
● Missing values in R are specified out by na or NA. When loading files where
missing values are specified differently, use na = “.” for example if missing
values are specified by a period.
– What will this line do?
read_csv(“students.csv”, skip = 2, comment = “//”, col_names = FALSE, na = “-”)
61
Importing Data: Parsing
●
The parse_*() functions:
– ?parse_logical, ?parse_integer, ?parse_date
●
The parse functions take in a character vector and return a
more specialized vector.
– Characters include everything, all letters and numbers, e.g.
“dLab”, “2013”, “xyz3”, “12.09”
– A specialized would contain say only numbers, or only decimal
numbers, or only characters, and this is what the parse functions
do: return a list of specific type of characters
●
A vector in R is a list of characters surrounded enclosed in
c()
– For example
names <- c(“John”, “Jean”, “Giovanni”, “Joni”)
dates_of_birth <- c(“2012-12-31”, “1988-05-02”, “1990-01-06”)
62
Importing Data: Parsing
●
What happens to the following?
parse_integer(c("1", "231", ".", "456"), na = ".")
x <- parse_integer(c("123", "345", "abc", "123.45"))
●
parse_logical() and parse_integer() parse logicals and integers respectively.
There’s basically nothing that can go wrong with these parsers so I won’t
describe them here further.
●
parse_double() is a strict numeric parser, and parse_number() is a flexible
numeric parser. These are more complicated than you might expect because
different parts of the world write numbers in different ways.
●
parse_character() seems so simple that it shouldn’t be necessary. But one
complication makes it quite important: character encodings.
●
parse_factor() create factors, the data structure that R uses to represent
categorical variables with fixed and known values.
●
parse_datetime(), parse_date(), and parse_time() allow you to parse various
date & time specifications. These are the most complicated because there
are so many different ways of writing dates.
63
Importing Data: parsing
●
One important thing to note is encoding when parsing character.
UTF-8 is the most common, it may save you hours of fixing
problems. Specify it when parsing characters like
x <- "El Niño was particularly bad this year"
parse_character(x, locale = locale(encoding = "utf-8"))
● ?parse_datetime, ?parse_date, ?parse_time
●
Generate correct format strings to parse each of the following
dates and times
– d1 <- "January 1, 2010"
– d2 <- "2015-Mar-07"
– d3 <- "06-Jun-2017"
– d4 <- c("August 19 (2015)", "July 1 (2015)")
– d5 <- "12/30/14" # Dec 30, 2014
– t1 <- "1705"
– t2 <- "11:15:10.12 PM"
64
Importing Data: parsing files
● example_file <- read_csv(readr_example("challenge.csv"))
●
Use the problems() function to look at any issues with the
import
– problems(example_file)
●
Specify the column names explicitly when reading the file
example_file <- read_csv(readr_example(“challenge.csv”),
col_types = cols(
x = col_double(),
y = col_date()
)
)
●
Use tail(dataframe, n=X) and head(dataframe, n=X) to look at
last and first X rows of the data frame.
65
Parsing files
● One more strategy to get the column types is
to use the guess_max option when reading in a
file.
example_file2 <- read_csv(readr_example("challenge.csv"),
guess_max = 1001)
66
Writing to a file
● If you want to save the data into CSV you can
use either of the functions
– write_csv() or write_tsv() where you need
to specify

The data frame you are saving
 The the file path (location) where to save it

Optionally:
– you can set how missing values are written with na
– You can also append to an existing file
67
Parsing Files
● Group Activity
– Download the dataset: Number of Trainees with
Special Needs enrolled in Vocational Training
Centres from http://opendata.go.tz

Read it into a data frame and do some manipulations
including making some plots
– Inspect
 read_rds() and write_rds() and see where you can
use these functions
– Explore these packages:

Haven, readxl, DBI
68
Tidy Data
●
A tidy dataset has these features
– Each variable is in its own column
– Each observation is in its own row
– Each value is in its own cell
● ?gather, ?spread
●
Missing Values:
– Can be explicitly stated with NA
– Can be implicit: not present in the data
●
With gather(…, na.rm=TRUE)
● You can use the complete() function to make missing
values explicit tidy data.
– ?complete
69
Case Study
● Optionally download the data from
http://www.who.int/tb/country/data/downlo
ad/en/
● Load the data from the file or from the
package: tidyr::who
● Looking at the data:
– Country, iso2, iso3 are similar: representing a
country
– Year is clearly a variable
– Other columns, have unclear names, look at the
dictionary
70
Case Study cntd...
● Gather all the other columns, removing all missing values
– who1 <- who %>%
gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
● Look at structure of the values in the new key by counting
– who1 %>%
count(key)
– Use the data dictionary for the definition of the keys
– who2 <- who1 %>%
– mutate(key = stringr::str_replace(key, "newrel", "new_rel"))
● Separate the key variable into different columns
– who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
● Look at new key
– who3 %>%
– count(new)
●
Drop new column because it is constant
– who4 <- who3 %>%
select(-new)
●
Separate sexage into sex and age
– who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
71
72
Writing Code in R
● Create new objects with <- with the format object_name
<- object_value
● The <- symbol is the assignment operator
● Examples:
– first_name <- “Sovello”
– date.of.birth <- “12/31/1980”
– PlaceOfBirth <- “Njombe”
– AGE <- 37
– x = 200 * 5
● Object names must start with a letter.
● Object names can only contain letters, numbers,
underscore (_), and period (.)
– Look at the examples above
73
Writing code in R
●
You can look at what is in R by typing the name of the object
●
You can also print an object explicitly
– print(first_name)
[1] “Sovello”

The [1] shown in the output indicates that x is a vector and 5 is its first element.
74
Writing code in R
●
All values that are not numbers must be
enclosed in double/single quotes (“value”, or
‘value’)
– Look at definition of place.of.birth in the screenshot
●
Typos matter, when using object names. Cases
matter a lot such that surname and Surname are
not the same.
●
The # character indicates a comment. Anything
to the right of # is ignored by R
● No multi-line comments
75
Group Exercise (5min)
●
What is wrong with this code snippet
Surname <- “Mkulima”
surname
●
If you start typing a value for an object and press enter
before an enclosing quote or paranthesis the code will look
like
college <- “College of informatics
+
– A + means you should continue typing. What would you do
to fix, stop or escape from the problem?
●
Fix errors in this piece of code until it works
library(tidyverse)
ggplot(dota = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
fliter(mpg, cyl = 8)
76
R Objects
●
R has five atomic objects
– Character
– Numeric (real numbers)
– Integer
– Complex
– Logical (True/False)
●
The most basic type of R is a vector. An empty vector can be
created with vector()
●
A vector can only contain objects of the same type.
●
Numbers are generally treated as numeric objects
– If you want an integer, you have to explicitly specify an L.

1L is an integer

1 is a real number
77
R Objects
● Inf is a special number which represents
infinity.
– You can use Inf in calculations like 1/Inf
● Creating vectors
● Use the c() function to create vectors
> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
78
Coercion of R objects
●
You can explicitly coerce objects using the as.* functions. ?
as.integer, ?as.character, ?as.logical, ?as.numeric
> x <- 0:6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"
●
If R fails to coerce an object, it produces NAs.
> x <- c("a", "b", "c")
> as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA
> as.logical(x)
[1] NA NA NA
> as.complex(x)
Warning: NAs introduced by coercion
[1] NA NA NA
79
R Objects: Matrices
●
Matrices are vectors with a dimension attribute.
●
The dimension is an integer vector of length 2
(number of rows, number of columns)
> m <- matrix(nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
> dim(m)
[1] 2 3
> attributes(m)
$dim
[1] 2 3
80
Matrices
● Matrices are constructed column-wise and so entries start at the
“upper left” corner and running down the columns
> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
●
You can create matrices from vectors by adding a dimensions attribute
> m <- 1:10
> m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m) <- c(2, 5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
●
Matrices must have every element be the same class (e.g. all integers
or all numeric).
81
Group work
● What do cbind() and rbind() do?
● Create 3 vectors and 3 matrices.
● Create 3 matrices from vectors
● Create 2 matrices using cbind() and
rbind()
● Read about R lists: how to create using
list()
82
R Objects: Factors
● Factors represent categorical data
● Factors can be ordered or unordered
● Factor objects can be created with the
factor() function
> x <- factor(c("yes", "yes", "no", "yes", "no"))
> x
[1] yes yes no yes no
Levels: no yes
> table(x)
x
no yes
2 3
83
Factors
●
Say you want to sort a vector
> x1 <- c("Dec", "Apr", "Jan", "Mar")
> sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"
●
The target was to see months sorted in the order of Jan, Mar, Apr, Dec
●
To solve this problem we can make use of factors
– Create a vector of months
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec”
)
●
Then create a vector with month levels.
> y1 <- factor(x1, levels = month_levels)
●
Applying sort on the new variable, will produce a sorted list in order
of months
> sort(y1)
84
R Objects: missing values
● Missing values are denoted by NA and NaN for undefined mathematical
operations
– is.na() is used to test objects if they are NA
– is.nan() is used to test for NaN
●
NA values have a class also, so there are integer NA, character NA,
etc.
●
A NaN value is also NA but the converse is not true
– > ## Create a vector with NAs in it
– > x <- c(1, 2, NA, 10, 3)
– > ## Return a logical vector indicating which elements are NA
– > is.na(x)
– [1] FALSE FALSE TRUE FALSE FALSE
– > ## Return a logical vector indicating which elements are NaN
– > is.nan(x)
– [1] FALSE FALSE FALSE FALSE FALSE
●
What is difference between missing values Nas and Zero
85
R Objects:Data Frames
● Data frames store tabular data in R
● Data frames are represented as a special type
of list where every element of the list has to
have the same length.
● Each element of the list can be thought of as a
column and the length of each element of the
list is the number of rows.
● Unlike matrices, data frames can store
different classes of objects in each column.
86
Data Frames
> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
> x
foo bar
1 TRUE
2 TRUE
3 FALSE
4 FALSE
> nrow(x)
[1] 4
> ncol(x)
[1] 2
87
Writing Code in R
● Scripts:
– Turning interactive code into scripts
88
Data Transformation
● Filter rows with filter()
– Comparisons: >, >=, <, <=, !=, ==
sqrt(2) ^ 2 == 2
– Logical operators
And &
Or | (shorthand x %in% y e.g. 2 %in% c(1, 2, 3, 4))
Not !
– To determing missing values is.na(x)
● Ordering: use arrange()
89
Reading Data: large datasets
●
With much larger datasets, there are a few things that
you can do that will make your life easier and will
prevent R from choking.
– Read the help page for read.table, which contains many hints
– Stop if your RAM is smaller than the size of the file
– Set comment.char = "" if there are no commented lines in
your file.
– Use the colClasses argument. Specifying this option instead
of using the default can make ’read.table’ run MUCH faster,
often twice as fast. You have to know the class of each
column
– Set nrows. This doesn’t make R run faster but it helps with
memory usage.
90
Reading large datasets
● A quick way to figure out the classes of each
column is the following:
> initial <- read.table("datatable.txt", nrows = 100)
> classes <- sapply(initial, class)
> tabAll <- read.table("datatable.txt", colClasses = classes)
91
Control Structures
● Control structures allow to control the flow of
execution of a series of R expressions.
● Control structures allow you to put some
“logic” into R code, rather than just always
executing the same R code every time.
● Control structures allow you to respond to
inputs or to features of the data and execute
different R expressions accordingly.
92
Control Structures: if-else
●
This if-else structure allows you to test a condition and act on it depending on
whether it’s true or false
– You can only use the if statement
if(<condition>) {
## do something
}
## Continue with rest of code
●
Or use the complete if-else
if(<condition>) {
## do something
}
else {
## do something else
}
●
You can have a series of tests by following the initial if with any number of else ifs.
if(<condition1>) {
## do something
} else if(<condition2>) {
## do something different
} else {
## do something different
}
93
Example: if-else
● ## Generate a uniform random number
x <- runif(1, 0, 10)
if(x > 3) {
y <- 10
} else {
y <- 0
}
●
This is the same as executing
y <- if(x > 3) {
10
} else {
0
}
94
Control Structures: for
● For loops are the only looping construct in R
for( x in sequence ){
##Execute code
}
● For one line loops, the curly braces are not
strictly necessary.
– > for(i in 1:4) print(x[i])
[1] "a"
[1] "b"
[1] "c"
[1] "d"
–
95
Control Structures: while
● While loops begin by testing a condition
● If it is true, they loop body is executed and
the condition is tested again until the
condition is false
> count <- 0
> while(count < 10) {
print(count)
count <- count + 1
}
96
Control Structures: next
● Next is used to skip an iteration of a loop
for(i in 1:100) {
if(i <= 20) {
## Skip the first 20 iterations
next
}
## Do something here
}
97
Control Structures: break
● Break is used to exit the loop immediately,
regardless of what the loop maybe on.
for(i in 1:100) {
print(i)
if(i > 20) {
## Stop loop after 20 iterations
break
}
}
98
Functions
99
Functions: scoping
100
Dates and Times
101
Loop functions
102
Simulating and Profiling
103
Vectorized Operations

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
maikroeder
 

Was ist angesagt? (20)

2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors2. R-basics, Vectors, Arrays, Matrices, Factors
2. R-basics, Vectors, Arrays, Matrices, Factors
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Data visualization with R
Data visualization with RData visualization with R
Data visualization with R
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
R programming Language
R programming LanguageR programming Language
R programming Language
 
1 R Tutorial Introduction
1 R Tutorial Introduction1 R Tutorial Introduction
1 R Tutorial Introduction
 
Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
Data visualization
Data visualizationData visualization
Data visualization
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Neo4j graph database
Neo4j graph databaseNeo4j graph database
Neo4j graph database
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
data analysis techniques and statistical softwares
data analysis techniques and statistical softwaresdata analysis techniques and statistical softwares
data analysis techniques and statistical softwares
 
Machine Learning with R
Machine Learning with RMachine Learning with R
Machine Learning with R
 
Introduction to data analysis using R
Introduction to data analysis using RIntroduction to data analysis using R
Introduction to data analysis using R
 
R Datatypes
R DatatypesR Datatypes
R Datatypes
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Unit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptxUnit 1 - R Programming (Part 2).pptx
Unit 1 - R Programming (Part 2).pptx
 
R programming
R programmingR programming
R programming
 
Introduction to Rstudio
Introduction to RstudioIntroduction to Rstudio
Introduction to Rstudio
 

Ähnlich wie R programming for data science

r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Ajay Ohri
 

Ähnlich wie R programming for data science (20)

CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIASTBUSINESS ANALYTICS WITH R SOFTWARE DIAST
BUSINESS ANALYTICS WITH R SOFTWARE DIAST
 
R - the language
R - the languageR - the language
R - the language
 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Turbogears2 tutorial to create mvc app
Turbogears2 tutorial to create mvc appTurbogears2 tutorial to create mvc app
Turbogears2 tutorial to create mvc app
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Open source projects with python
Open source projects with pythonOpen source projects with python
Open source projects with python
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
DataBase Management System Lab File
DataBase Management System Lab FileDataBase Management System Lab File
DataBase Management System Lab File
 
Not Your Fathers C - C Application Development In 2016
Not Your Fathers C - C Application Development In 2016Not Your Fathers C - C Application Development In 2016
Not Your Fathers C - C Application Development In 2016
 
Yocto and IoT - a retrospective
Yocto and IoT - a retrospectiveYocto and IoT - a retrospective
Yocto and IoT - a retrospective
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
vega
vegavega
vega
 
Your First Scala Web Application using Play 2.1
Your First Scala Web Application using Play 2.1Your First Scala Web Application using Play 2.1
Your First Scala Web Application using Play 2.1
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
SystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWDSystemML - Datapalooza Denver - 05.17.16 MWD
SystemML - Datapalooza Denver - 05.17.16 MWD
 
Custom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDBCustom Pregel Algorithms in ArangoDB
Custom Pregel Algorithms in ArangoDB
 

Kürzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 

R programming for data science

  • 1. R Programming for Data Science Sovello Hildebrand Mgani sovellohpmgani@gmail.com
  • 2. 2 Outline ● History of R ● Installation (Windows and Linux) ● Data Types ● Reading Data: – Tabular – Large datasets ● Textual Data Formats ● Subsetting: – Lists, Matrices, Partial matching – Removing missing values
  • 3. 3 Outline ● Vectorized operations ● Control Structures – If-else – For, while, repeat, next break ● Functions – Scoping ● Dates and Times ● Loop functions – lapply, tapply, apply, mapply, split, ● Simulation and profiling – Generating random numbers, simulating a linear model, random sampling ● Visualizations
  • 4. 4 History of R ● Originates from S language. S was initiated in 1976 as an internal statistical analysis environment—originally implemented as Fortran libraries – History of S: http://www.stat.bell-labs.com/S/history.html ● R development history: – https://en.wikipedia.org/wiki/R_(programming_la nguage )
  • 5. 5 R and Statistics ● R developed from S which is a statistical analysis tool, and so is R ● Its functionality is divided into modules – Need to load a module for different functionalities ● Has very sophisticated graphics capabilities than most other statistical packages ● Useful for interactive work: run from terminal ● Contains a powerful programming language for developing new tools – Tools: for visualizations and analysis
  • 6. 6 Design of the R System ● The “base” system, downloaded from CRAN ● “All other stuff” ● Packages in R – The “base” has the base package required to run R and has the most fundamental functions – Other packages contained in the “base”. Need to load these to be able to use them: utils, stats, datasets, graphics, grDevices, tools, etc. – Recommended packages: boot, class, cluster, codetools, foreign, lattice, etc. – Load packages with library(), or require()
  • 7. 7 R Resources ● CRAN: – http://cran.r-project.org ● Quick-R: a book – http://www.statmethods.net/ ● R bloggers (platform): not a social network – R-Bloggers is about empowering bloggers to empower other R users – R-Bloggers.com is a blog aggregator of content contributed by bloggers who write about R (in English) – https://www.r-bloggers.com/
  • 8. 8 Installation of R: Ubuntu ● Run from terminal: – sudo apt-get install r-base r-base-dev ● If this doesn’t work, then you need – To add the repositories:  sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list – Add the keyring:  gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9  gpg -a --export E084DAB9 | sudo apt-key add - – Install R-Base  sudo apt-get update; sudo apt-get install r-base r-base-dev ● You can install from a PPA which has the most recent versions – Add the PPA  sudo add-apt-repository ppa:marutter/rrutter – Install R-Base  sudo apt-get update; sudo apt-get install r-base r-base-dev
  • 9. 9 Installation of R: Windows ● Visit CRAN – https://cran.r-project.org/ ● CRAN: Comprehensive R Archive Network
  • 10. 10 Installation of R: Windows Click/Select Download R for Windows
  • 11. 11 Installation of R: Windows Then click/select base or install R for the first time
  • 12. 12 Installation of R: Windows ● Then click/select Download R X.X.X for Windows ● After the download has finished, locate the downloaded file and install.
  • 14. 14 RStudio: Introduction ● RStudio is a set of integrated tools designed to help you be more productive with R. ● How? – It includes a console, – syntax-highlighting editor that supports direct code execution, – a variety of robust tools for  plotting,  viewing history,  debugging and  managing your workspace.
  • 15. 15 RStudio: Installation ● From the RStudio home page, go to Products then select RStudio – Then scroll down and click Download RStudio Desktop – Then click Download under RStudio Desktop Personal License. – Select RStudio for your platform. Clicking on the link will download the file directly. – Locate the file in your system Downloads folder and start the installation.
  • 16. 16 RStudio: Parts The Console is where you write and run code interactively The Files tab shows all the files and folders in your default workspace as if you were on a PC/Mac window. The Plots tab will show all your graphs. The Packages tab will list a series of packages or add-ons needed to run certain processes. For additional info see the Help tab The Environment tab shows all the active objects The History tab shows a list of commands used so far
  • 17. 17 RStudio: Working Directory ● It is important to organize all files for a particular project under one main/parent directory ● A working directory in RStudio is where all the files for a particular project are stored ● All paths used in the console to load data files and scripts are relative to the working directory.
  • 18. 18 ● To set the working directory: – Start RStudio the same way you start other programs in your computer – From the File menu options select New Project then select New Directory then Empty Project then type the directory name (rprogramming) then under create project as subdirectory of click Browse and select Desktop ● RStudio: Working Directory
  • 19. 19 R: Getting Started ● A few basic commands to test them on the console – getwd(): get current working directory – setwd(“/path/to/directory”): set a working directory to the specified path – install.packages(“package_name”): install a package. Requires internet connection – library(package_name), require(package_name): load and attach add-on packages – ?object: provide documentation/help for an object. e.g. ?mtcars – summary(object): provide a summary of an object like a dataset e.g. summary(mtcars) ● Everytime you run library(package_name) and get an error “there is no package called ‘package_name’”, you will need to install it first then call library on it.
  • 20. 20 Data Visualizations in R: Introduction ● R has different systems (packages) for making graphs (visualizations) ● For this case we are going to use ggplot2 which is more elegant and versatile compared to many others. (ggvis, rgl, htmlwidgets, googleVis, etc.) ● Ggplot2 is built upon the “ The Layered Grammar of Graphics”
  • 21. 21 Data Visualizations in R: Tidyverse ● Tidyverse is a set of packages – The packages work in harmony  Reason: they share common data representations and API design. ● The tidyverse package makes it easy to install and load core packages from it in a single command ● To install run: install.packages(“tidyverse”) ● To use it run: library(tidyverse)which loads tidyverse core packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr. – Google each one of these packages to learn what they do
  • 22. 22 Data Visualizations: First Steps ● library(tidyverse) loads all the core packages from tidyverse ● The library() function also tells any conflicts with base R or other packages that arise from loading the named package. ● e.g. for this case filter() and lag() are functions from tidyverse that conflict with similar functions from dplyr and stats packages ● In this case you may need to call a function explicitly from a package in the form. package::function() ● e.g. ggplot2::ggplot() calls the ggplot function from ggplot2 package.
  • 23. 23 ● Which is more fuel efficient: cars with big engines or cars with small engines? ● The mpg data frame: – Data Frame: is a rectangular collection of variables in columns and observations in rows  The mpg data frame in ggplot2 contains observations collected by the US Environment Protection Agency on 38 models of cars. ● Run (from console) ?mpg to learn more about the data set. Data Visualizations: First Steps
  • 24. 24 First Steps Creating a ggplot ● To answer the question about fuel efficiency plot fuel consumption (hwy: y-axis) against engine size (displ: x-axis) ● See the magic of this command: – ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
  • 25. 25 First Steps Creating a ggplot > ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) A negative relationship between engine size (displ) and fuel efficiency (hwy) means Cars with bigger engines use more fuel.
  • 26. 26 Creating a ggplot ● In ggplot2, – You begin with the function ggplot()  ggplot() creates a coordinate system that you can add layers onto.  The first argument is the data set that you are going to use for plotting – To complete the graph add more layers to the coordinate system created by ggplot()  geom_point() function adds a layer of points to plot (which creates a scatter plot for this case)  Each function in ggplot2 takes a mapping argument which defines how variables are mapped to visual properties.  The mapping argument is always paired with aes() – The x and y arguments of aes() specify which variables to map to the x and y axes. – ggplot2 looks for the mapped variable in the data argument, in this case, mpg
  • 27. 27 Creating a ggplot: Template ● A graphing template for ggplot ● You can get a list of <GEOM_FUNCTION>s by following this link (http://docs.ggplot2.org/current/)
  • 28. 28 ggplot: Aesthetics Mappings ● Look at the graph and note the circled dots ● What is special with these big engine cars?
  • 29. 29 ggplot: Aesthetics ● Ggplot Aesthetic mappings can help answer the question ● An aesthetic is a visual property of the objects in a plot. – These are things like size, shape or color of points. ● You can therefore display a point in different ways by changing the values of its aesthetic properties. ● You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset. – e.g. you can map the colors of your points to the class variable to reveal the class of each car.
  • 30. 30 ggplot: Aesthetics ● New plot with aesthetics for class: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class)) ● Try for year and manufacturer and look at the trends
  • 31. 31 ggplot: Aesthetics ● Other aesthetics: – Size: for ordered variables, so each point reveals its attribute size – Alpha: controls the transparency of the points – Shape: points will be of different shapes  Exercise: try plotting the same geom with these different aesthetics ● ggplot2 takes care of selecting a reasonable scale to use with the aesthetic and constructs a legend
  • 32. 32 ggplot: Aesthetics ● The aesthetic properties of a geom can be set manually. – For example:  ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue") – Will set all points to blue – Note color is outside the aes()
  • 34. 34 ● When the data has categorical variables, it is possible to split the plot into facets. ● Facets are subplots that each displays a subset of data. ● To plot facets, with a single variable, use the function facet_wrap(formula, …) – formula is created with ~ variable-name – formula is the name of a data structure in R, not a synonym for equation. – The variable (variable-name) should be discrete. ggplot: Facets
  • 35. 35 ggplot: Facets ● For example: – ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color=”red”) + facet_wrap(~ class, nrow = 3) ● This will produce a plot for each element in mpg.class, and the plot will display in three rows.
  • 36. 36 ggplot: Facets ● Can we facet the plot using two discrete variables: ● Do this: – ?facet_grid – ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)  In the plot, why do we have empty sub-plots? ●
  • 37. 37 ggplot: Facets ● Hack: – With facet grid, what happens when you use a . at the place of one variable? – Is there an advantage of faceting over the color aesthetic? Any disadvantages? What is the dataset is very large? – In facet_wrap() what do nrow or ncol do? – When using facet_grid() put the variable with more unique levels in the columns (RHS of formula), why?  Why doesn’t facet_grid() have nrow, and ncolumn 
  • 38. 38 ggplot2::Geometric objects (geoms) ● These are the geometric objects used to represent the data. – e.g. bar geoms, point geoms, line geoms, smooth geoms, etc. ● To change the geom in your plot, change the geom function (geom_xxx()) ● For example: – ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) – ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy)) ● Not every aesthetic works with every geom – e.g. you can’t set a shape of a line but of a point – Read: ?geom_point, ?geom_smooth
  • 39. 39 ggplot2: geoms ● ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) ● Try: – ggplot(data = mpg) + geom_line(mapping = aes(x = displ, y = hwy, linetype = drv))
  • 40. 40 ggplot2: geoms ● Plot: – ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv)) – ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y – hwy, group = drv))  What is the difference? Which is better? Why?
  • 41. 41 Ggplot2: combined geoms ● Can we use more than one geoms on the same plot? ● Try: – ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy)) ● When using multiple geoms on the same plot you can use global mappings: – ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()  Which makes the code easy to read and modify.
  • 42. 42 ggplot2: combined geoms ● When you use global mappings and set some mappings in a geom function, these mappings will be treated as local to this layer only. ● For example: – ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth()
  • 43. 43 ggplot2: combined geoms ● In the same way, you can specify different data for each layer. – Say you only want to fit a smooth line for one class of cars – ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE) – Hack:  can we plot more than one of the same geom? – Try a smooth geom with different car class
  • 46. 46 Ggplot2: geoms ● How many geoms does ggplot2 have? – Visit this page: https://www.rstudio.com/resources/cheatsheets/  Look for Data Visualization Cheat Sheet ● ggplot2 extensions provide more geoms to use. Take a look at available extensions from this gallery (http://www.ggplot2-exts.org/gallery/) ●
  • 47. 47 ggplot2: statistical transformations ● Read: ?diamonds – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) – Where does count come from?
  • 48. 48 Statistical Transformations ● Some plots plot raw values – e.g. scatterplots, ● Some plots use calculated values – bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin. – smoothers fit a model to your data and then plot predictions from the model. (Remember regression lines) – boxplots compute a robust summary of the distribution and then display a specially formatted box. –
  • 49. 49 Statistical Transformation ● The algorithm used to calculate new values for a graph is called a stat, (Statistical Transformation) ● You can check which stat is used by default by looking at the default value of stat. – geom_bar() uses count. Thus you can recreate the bar chart by running  ggplot(data = diamonds) + stat_count(mapping = aes(x = cut)) ● Every geom has a default stat; and vice-versa. This means that you can typically use geoms without worrying about the underlying statistical transformation.
  • 50. 50 Statistical Transformation ● You can explicitly specify a stat: ● When you want to override the default stat  e.g. Run demo <- tribble( ~a, ~b, "bar_1", 20, "bar_2", 30, "bar_3", 40 )  Then run ggplot(data = demo) + geom_bar(mapping = aes(x = a, y = b), stat = "identity")
  • 51. 51 Statistical Transformation ● Reasons to explicitly specify a stat: cntd – You want to override the default mapping from transformed variables to aesthetics.  ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1)) – This will draw a bar chart of proportion instead of count
  • 52. 52 Position Adjustments ● A bar chart can be colored in either of two ways: color and fill. – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut)) – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut))
  • 53. 53 Position Adjustments ● Check how the following plots will look like – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity)) – ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + geom_bar(alpha = 1/5, position = "identity") – ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + geom_bar(fill = NA, position = "identity") – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
  • 54. 54 Position Adjustments ● Learn more about position adjustments – ?position_dodge, – ?position_fill, – ?position_identity, – ?position_jitter – ?position_stack
  • 55. 55 Position Adjustments:overplotting. ● Recall: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) – It displays fewer than 234 points: the number of observations (can you count them?) – The values of displ and hwy are rounded and many points overlap each other. That is a problem called overplotting. ● You can avoid this gridding by setting the position adjustment to “jitter” – position = “jitter” adds a small amount of random noise to each point – Since no points can receive the same amount of noise, they are going to be spread out. ● Jittering makes the graph less accurate at small scales, however it will make the graph more revealing at large scales. ● In ggplot2 the shorthand for geom_point(position = "jitter") is geom_jitter()
  • 56. 56 Position Adjustments: jitter ● ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
  • 58. 58 Working with Data ● In this part we are going to learn how to work with your data. – Getting data  Importing your own data  Tidying data – How to work with different data types:  Relational data,  Strings,  Factors,  Dates and Times
  • 59. 59 Importing Data ● For importing files, we will use the readr package which is part of the tidyverse core packages. ● Most of readr functions turn flat files into data frames. A Data Frame is a tabular data format with rows and columns. It is a list of vectors of equal length. – read_csv(): reads comma separated files – read_csv2(): reads semicolon separated files – read_tsv(): read tab delimited files – read_delim(): reads files with any delimiter ● Activity: – Check what read_table(), read_fwf() and read_log() do?
  • 60. 60 Importing Data: read_csv() ● The first argument is the path to the file to read – read_csv(“data/students.csv”) ● read_csv() prints out a column specification ● read_csv() by default uses the first row as the column names – You can use skip = n, to skip the first n lines if they contain data you don’t need, (most likely metadata) – You can use comment = “#” to drop all lines that start with # for example – Use col_names = FALSE so that read_csv() doesn’t treat the first row as the column names ● Missing values in R are specified out by na or NA. When loading files where missing values are specified differently, use na = “.” for example if missing values are specified by a period. – What will this line do? read_csv(“students.csv”, skip = 2, comment = “//”, col_names = FALSE, na = “-”)
  • 61. 61 Importing Data: Parsing ● The parse_*() functions: – ?parse_logical, ?parse_integer, ?parse_date ● The parse functions take in a character vector and return a more specialized vector. – Characters include everything, all letters and numbers, e.g. “dLab”, “2013”, “xyz3”, “12.09” – A specialized would contain say only numbers, or only decimal numbers, or only characters, and this is what the parse functions do: return a list of specific type of characters ● A vector in R is a list of characters surrounded enclosed in c() – For example names <- c(“John”, “Jean”, “Giovanni”, “Joni”) dates_of_birth <- c(“2012-12-31”, “1988-05-02”, “1990-01-06”)
  • 62. 62 Importing Data: Parsing ● What happens to the following? parse_integer(c("1", "231", ".", "456"), na = ".") x <- parse_integer(c("123", "345", "abc", "123.45")) ● parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further. ● parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways. ● parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings. ● parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values. ● parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.
  • 63. 63 Importing Data: parsing ● One important thing to note is encoding when parsing character. UTF-8 is the most common, it may save you hours of fixing problems. Specify it when parsing characters like x <- "El Niño was particularly bad this year" parse_character(x, locale = locale(encoding = "utf-8")) ● ?parse_datetime, ?parse_date, ?parse_time ● Generate correct format strings to parse each of the following dates and times – d1 <- "January 1, 2010" – d2 <- "2015-Mar-07" – d3 <- "06-Jun-2017" – d4 <- c("August 19 (2015)", "July 1 (2015)") – d5 <- "12/30/14" # Dec 30, 2014 – t1 <- "1705" – t2 <- "11:15:10.12 PM"
  • 64. 64 Importing Data: parsing files ● example_file <- read_csv(readr_example("challenge.csv")) ● Use the problems() function to look at any issues with the import – problems(example_file) ● Specify the column names explicitly when reading the file example_file <- read_csv(readr_example(“challenge.csv”), col_types = cols( x = col_double(), y = col_date() ) ) ● Use tail(dataframe, n=X) and head(dataframe, n=X) to look at last and first X rows of the data frame.
  • 65. 65 Parsing files ● One more strategy to get the column types is to use the guess_max option when reading in a file. example_file2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
  • 66. 66 Writing to a file ● If you want to save the data into CSV you can use either of the functions – write_csv() or write_tsv() where you need to specify  The data frame you are saving  The the file path (location) where to save it  Optionally: – you can set how missing values are written with na – You can also append to an existing file
  • 67. 67 Parsing Files ● Group Activity – Download the dataset: Number of Trainees with Special Needs enrolled in Vocational Training Centres from http://opendata.go.tz  Read it into a data frame and do some manipulations including making some plots – Inspect  read_rds() and write_rds() and see where you can use these functions – Explore these packages:  Haven, readxl, DBI
  • 68. 68 Tidy Data ● A tidy dataset has these features – Each variable is in its own column – Each observation is in its own row – Each value is in its own cell ● ?gather, ?spread ● Missing Values: – Can be explicitly stated with NA – Can be implicit: not present in the data ● With gather(…, na.rm=TRUE) ● You can use the complete() function to make missing values explicit tidy data. – ?complete
  • 69. 69 Case Study ● Optionally download the data from http://www.who.int/tb/country/data/downlo ad/en/ ● Load the data from the file or from the package: tidyr::who ● Looking at the data: – Country, iso2, iso3 are similar: representing a country – Year is clearly a variable – Other columns, have unclear names, look at the dictionary
  • 70. 70 Case Study cntd... ● Gather all the other columns, removing all missing values – who1 <- who %>% gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE) ● Look at structure of the values in the new key by counting – who1 %>% count(key) – Use the data dictionary for the definition of the keys – who2 <- who1 %>% – mutate(key = stringr::str_replace(key, "newrel", "new_rel")) ● Separate the key variable into different columns – who3 <- who2 %>% separate(key, c("new", "type", "sexage"), sep = "_") ● Look at new key – who3 %>% – count(new) ● Drop new column because it is constant – who4 <- who3 %>% select(-new) ● Separate sexage into sex and age – who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)
  • 71. 71
  • 72. 72 Writing Code in R ● Create new objects with <- with the format object_name <- object_value ● The <- symbol is the assignment operator ● Examples: – first_name <- “Sovello” – date.of.birth <- “12/31/1980” – PlaceOfBirth <- “Njombe” – AGE <- 37 – x = 200 * 5 ● Object names must start with a letter. ● Object names can only contain letters, numbers, underscore (_), and period (.) – Look at the examples above
  • 73. 73 Writing code in R ● You can look at what is in R by typing the name of the object ● You can also print an object explicitly – print(first_name) [1] “Sovello”  The [1] shown in the output indicates that x is a vector and 5 is its first element.
  • 74. 74 Writing code in R ● All values that are not numbers must be enclosed in double/single quotes (“value”, or ‘value’) – Look at definition of place.of.birth in the screenshot ● Typos matter, when using object names. Cases matter a lot such that surname and Surname are not the same. ● The # character indicates a comment. Anything to the right of # is ignored by R ● No multi-line comments
  • 75. 75 Group Exercise (5min) ● What is wrong with this code snippet Surname <- “Mkulima” surname ● If you start typing a value for an object and press enter before an enclosing quote or paranthesis the code will look like college <- “College of informatics + – A + means you should continue typing. What would you do to fix, stop or escape from the problem? ● Fix errors in this piece of code until it works library(tidyverse) ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) fliter(mpg, cyl = 8)
  • 76. 76 R Objects ● R has five atomic objects – Character – Numeric (real numbers) – Integer – Complex – Logical (True/False) ● The most basic type of R is a vector. An empty vector can be created with vector() ● A vector can only contain objects of the same type. ● Numbers are generally treated as numeric objects – If you want an integer, you have to explicitly specify an L.  1L is an integer  1 is a real number
  • 77. 77 R Objects ● Inf is a special number which represents infinity. – You can use Inf in calculations like 1/Inf ● Creating vectors ● Use the c() function to create vectors > x <- c(0.5, 0.6) ## numeric > x <- c(TRUE, FALSE) ## logical > x <- c(T, F) ## logical > x <- c("a", "b", "c") ## character > x <- 9:29 ## integer > x <- c(1+0i, 2+4i) ## complex
  • 78. 78 Coercion of R objects ● You can explicitly coerce objects using the as.* functions. ? as.integer, ?as.character, ?as.logical, ?as.numeric > x <- 0:6 > class(x) [1] "integer" > as.numeric(x) [1] 0 1 2 3 4 5 6 > as.logical(x) [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > as.character(x) [1] "0" "1" "2" "3" "4" "5" "6" ● If R fails to coerce an object, it produces NAs. > x <- c("a", "b", "c") > as.numeric(x) Warning: NAs introduced by coercion [1] NA NA NA > as.logical(x) [1] NA NA NA > as.complex(x) Warning: NAs introduced by coercion [1] NA NA NA
  • 79. 79 R Objects: Matrices ● Matrices are vectors with a dimension attribute. ● The dimension is an integer vector of length 2 (number of rows, number of columns) > m <- matrix(nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA > dim(m) [1] 2 3 > attributes(m) $dim [1] 2 3
  • 80. 80 Matrices ● Matrices are constructed column-wise and so entries start at the “upper left” corner and running down the columns > m <- matrix(1:6, nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 ● You can create matrices from vectors by adding a dimensions attribute > m <- 1:10 > m [1] 1 2 3 4 5 6 7 8 9 10 > dim(m) <- c(2, 5) > m [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 ● Matrices must have every element be the same class (e.g. all integers or all numeric).
  • 81. 81 Group work ● What do cbind() and rbind() do? ● Create 3 vectors and 3 matrices. ● Create 3 matrices from vectors ● Create 2 matrices using cbind() and rbind() ● Read about R lists: how to create using list()
  • 82. 82 R Objects: Factors ● Factors represent categorical data ● Factors can be ordered or unordered ● Factor objects can be created with the factor() function > x <- factor(c("yes", "yes", "no", "yes", "no")) > x [1] yes yes no yes no Levels: no yes > table(x) x no yes 2 3
  • 83. 83 Factors ● Say you want to sort a vector > x1 <- c("Dec", "Apr", "Jan", "Mar") > sort(x1) [1] "Apr" "Dec" "Jan" "Mar" ● The target was to see months sorted in the order of Jan, Mar, Apr, Dec ● To solve this problem we can make use of factors – Create a vector of months month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec” ) ● Then create a vector with month levels. > y1 <- factor(x1, levels = month_levels) ● Applying sort on the new variable, will produce a sorted list in order of months > sort(y1)
  • 84. 84 R Objects: missing values ● Missing values are denoted by NA and NaN for undefined mathematical operations – is.na() is used to test objects if they are NA – is.nan() is used to test for NaN ● NA values have a class also, so there are integer NA, character NA, etc. ● A NaN value is also NA but the converse is not true – > ## Create a vector with NAs in it – > x <- c(1, 2, NA, 10, 3) – > ## Return a logical vector indicating which elements are NA – > is.na(x) – [1] FALSE FALSE TRUE FALSE FALSE – > ## Return a logical vector indicating which elements are NaN – > is.nan(x) – [1] FALSE FALSE FALSE FALSE FALSE ● What is difference between missing values Nas and Zero
  • 85. 85 R Objects:Data Frames ● Data frames store tabular data in R ● Data frames are represented as a special type of list where every element of the list has to have the same length. ● Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. ● Unlike matrices, data frames can store different classes of objects in each column.
  • 86. 86 Data Frames > x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) > x foo bar 1 TRUE 2 TRUE 3 FALSE 4 FALSE > nrow(x) [1] 4 > ncol(x) [1] 2
  • 87. 87 Writing Code in R ● Scripts: – Turning interactive code into scripts
  • 88. 88 Data Transformation ● Filter rows with filter() – Comparisons: >, >=, <, <=, !=, == sqrt(2) ^ 2 == 2 – Logical operators And & Or | (shorthand x %in% y e.g. 2 %in% c(1, 2, 3, 4)) Not ! – To determing missing values is.na(x) ● Ordering: use arrange()
  • 89. 89 Reading Data: large datasets ● With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking. – Read the help page for read.table, which contains many hints – Stop if your RAM is smaller than the size of the file – Set comment.char = "" if there are no commented lines in your file. – Use the colClasses argument. Specifying this option instead of using the default can make ’read.table’ run MUCH faster, often twice as fast. You have to know the class of each column – Set nrows. This doesn’t make R run faster but it helps with memory usage.
  • 90. 90 Reading large datasets ● A quick way to figure out the classes of each column is the following: > initial <- read.table("datatable.txt", nrows = 100) > classes <- sapply(initial, class) > tabAll <- read.table("datatable.txt", colClasses = classes)
  • 91. 91 Control Structures ● Control structures allow to control the flow of execution of a series of R expressions. ● Control structures allow you to put some “logic” into R code, rather than just always executing the same R code every time. ● Control structures allow you to respond to inputs or to features of the data and execute different R expressions accordingly.
  • 92. 92 Control Structures: if-else ● This if-else structure allows you to test a condition and act on it depending on whether it’s true or false – You can only use the if statement if(<condition>) { ## do something } ## Continue with rest of code ● Or use the complete if-else if(<condition>) { ## do something } else { ## do something else } ● You can have a series of tests by following the initial if with any number of else ifs. if(<condition1>) { ## do something } else if(<condition2>) { ## do something different } else { ## do something different }
  • 93. 93 Example: if-else ● ## Generate a uniform random number x <- runif(1, 0, 10) if(x > 3) { y <- 10 } else { y <- 0 } ● This is the same as executing y <- if(x > 3) { 10 } else { 0 }
  • 94. 94 Control Structures: for ● For loops are the only looping construct in R for( x in sequence ){ ##Execute code } ● For one line loops, the curly braces are not strictly necessary. – > for(i in 1:4) print(x[i]) [1] "a" [1] "b" [1] "c" [1] "d" –
  • 95. 95 Control Structures: while ● While loops begin by testing a condition ● If it is true, they loop body is executed and the condition is tested again until the condition is false > count <- 0 > while(count < 10) { print(count) count <- count + 1 }
  • 96. 96 Control Structures: next ● Next is used to skip an iteration of a loop for(i in 1:100) { if(i <= 20) { ## Skip the first 20 iterations next } ## Do something here }
  • 97. 97 Control Structures: break ● Break is used to exit the loop immediately, regardless of what the loop maybe on. for(i in 1:100) { print(i) if(i > 20) { ## Stop loop after 20 iterations break } }