R is a fun and versatile language for statistical analysis, visualization, and data exploration. Target audience are software engineers/programmers who can code comfortably in another language. Emphasis in this lesson is on data structures, and light on analysis examples (to be covered at later date) but you are exposed to the basic concepts and commands. Email me for the pptx file which has notes.
2. 2
•What is R?
•Data Structures and Types
•Syntax
•Statistics
•Visualizations
•File I/O
•Packages
•Finding Help
•More Syntax & Common Functions
OVERVIEW
3. 3
• Presentation steps thru code and responses (Quick Starts)
user.input() Input at the prompt, typed by you
#> console output Results produced after hitting enter
• Followed by slides with functions and descriptions (Basics)
... More data was produced, but not displayed here for space
### Comment, note, tip
FORMAT
5. 5
• R, the [interpreted] language
• 800,000 lines of code
• 45% C
• 19% R
• 17% Fortran
• R, the implementation(s)
• GNU-R is most poular implementation
• Open source version (GNU) of the S language and environment
• Developed by Bell Labs by John Chambers, et al.
• Licensed under the GNU General Public License (GPL)
https://www.r-project.org/about.html
WHAT IS R?
12. 12
• R is easy to learn, intuitive
• R was made for statistics
• R makes great graphics, pub quality
• R is optimized to work with tabular data structures
• R – it’s fast
• R is versatile – thousands of packages on CRAN alone
• R is open source
BUT…
• Memory limitations***
• Some data wrangling problems are clumsy
WHY R?
…enough!
13. 13
• GNU-R 3.3.2
• https://cran.r-project.org
• Microsoft R Open / MRO
• https://mran.microsoft.com/download/
• RStudio dev environment
• https://www.rstudio.com/
…and many other implementations
HOW TO GET R?
15. 15
• Vector 1-dimensional array of elements of the same kind
• Scalar?
• Matrix 2D array of elements of the same kind
• Array multi-dimensional structure of one kind of value
• List Something that holds something else (e.g. another list)
• Data frame 2D structure of possibly different types of columns
DATA STRUCTURES
16. 16
• Logical TRUE/FALSE or T/F
• Integers …-1,0,1,2,3…
• Double or “numeric” 1.0, 3.14, 4.002e-6
• Character "Hello", '123abc'
Some query and coercion functions
Informative: typeof( )
T/F Testing: is.numeric( )
is.character( )
Coersion: as.numeric("02")
as.character(3.1415)
DATA VALUE TYPES
17. 17
1:5 ### integer sequence
#> [1] 1 2 3 4 5
x <- 1:5 ### assignment
x ### evaluation. Try: x+x
#> [1] 1 2 3 4 5
y = x^2
print(y)
#> [1] 1 4 9 16 25
y == x ### comparison. What results?
#> [1] TRUE FALSE FALSE FALSE FALSE
QUICK START: NUMERICS
19. 19
• Command line: Evaluates the line entered, or if the line is incomplete,
it waits for the end of an expression
• Variable names: Consist of numbers, letters, underscores, and periods
• Must start with a letter or a period+letter …CASE sensitive!
• Assignment: <- , = right to left
-> left to right
• Comparison: == != < > <= >=
SYNTAX BASICS
preferred
20. 20
; Ends a statement
x <- 8; print(x)
# Comments out anything following
#but only on that line
( ) Groups expressions, enclosing function arguments
print(2*(6+c((1:4)+x)))
{ } Encloses groups of expressions, loops, if/then
if (x) { print(c("x^2 =", x^2))
print("Done.") }
[ [[ $ Subsets elements of a data object
SYNTAX BASICS
21. 21
? <fxn> Opens help page for the command/concept/constant
?? <term> Lists help pages with term in their content
typeof( ) Identifies the data type of the object
str( ) Shows structure of data object, types of cols in df
length( ), nchar( )
Counts elements in vectors, letters in string
dim( ), nrow( ), ncol( )
Returns dimensions of data object
names( ), colnames( ), rownames( )
Displays only list/column/row names
SELF-HELP BASICS
22. 22
NA missing values
NaN not a number
Inf infinity
NULL empty/nothing
• NA has a type, determined at time of assignment
• Mixed types are coerced into the most flexible type
x <- c(TRUE, 1, 4.4, NA) ; typeof(x[4])
#> [1] "double"
• Predefined constants in base R:
Letters, LETTERS, month.abb, month.name, pi
SPECIAL VALUES
33. 33
tf <- c(T,F)
df <- data.frame(x = 1:5, y = letters[1:10], z = tf)
df
#> x y z
#> 1 1 a TRUE
#> 2 2 b FALSE
#> 3 3 c TRUE
#> 4 4 d FALSE
#> 5 5 e TRUE
#> 6 1 f FALSE
#> 7 2 g TRUE
#> 8 3 h FALSE
#> 9 4 i TRUE
#> 10 5 j FALSE
QUICK START: DATAFRAMES
Combining and amending:
cbind( ), rbind( )
merge( )
data.frame(df, tf)
df$newCol <- NA
Deleting column x:
df[-1] -> df
df[,-1] -> df
df$x <- NULL
34. 34
• With what you know about lists and dataframes…
• What happens when we execute these lines?
c(df, df)
typeof(c(df, df))
str(c(df, df))
c(df, mtcars)
DATAFRAMES QUIZ
36. 36
summary
round, ceiling, floor
sin, cos, …, exp, log, log10, log2
sum, diff, filter
union, intersect
mean, sd, var, weighted.mean
median, Mode, quartile, fivenum
min, max, range
&, |, !, xor
all, any
SELECTED STATS FUNCTIONS
37. 37
mymodel <- lm(MT$mpg ~ MT$hp) ### y ~ x
mymodel
#>
#> Call:
#> lm(formula = y ~ x)
#>
#> Coefficients:
#> (Intercept) x
#> 32.77745 -0.05827
LINEAR MODEL EXAMPLE
43. 43
• We can place multiple plots on one window:
par(mfrow = c(1, 2)) ### request 1x2 layout
hist(mtcars$hp, xlab = "Horsepower",
main = "Histogram of HP")
hist(mtcars$mpg, xlab = "Miles per gallon",
breaks = 10, main = "Histogram of MPG")
BASIC PLOTS: HISTOGRAM
49. 49
•Package-related commands:
library() Lists installed packages
search() Lists loaded packages
install.packages("mylib") Installs package called “mylib”
library(mylib) Loads mylib into current env
require(mylib) …used inside other functions
https://cran.r-project.org/web/packages/
PACKAGE MANAGEMENT
50. 50
plyr Tools for splitting, applying and combining data
data.table Extension of Data.frame, highly optimized
ggplot2 An Implementation of the Grammar of Graphics
colorspace Color Space Manipulation
shiny Web Application Framework for R
chron Chronological Objects which handle dates and times
RCurl General Network (HTTP/FTP/...) Client Interface for R
wordcloud Make Word Clouds
rjson, RJSONIO JSON tools for R
htmltools Tools for HTML
pdftools Extract Text and Data from PDF Documents
xlsx Read, write, format Excel 2007 and Excel 97/2000/XP/2003 files
XML Tools for Parsing and Generating XML Within R and S-Plus
xtable Export Tables to LaTeX or HTML
MY "MUST-HAVE" & OTHER POPULAR PACKAGES
52. 52
• Beginning R (Wiley) FREE Chapter 1 online
• Mark Gardner
• R Graphics Cookbook (O'Reilly) FREE full text online
• Winston Chang
• Main focus is on ggplot2 package. Problem-Solution format
• R for Data Science (O'Reilly) FREE full text online
• Hadley Wickham and Garrett Grolemund
• Advanced R (CRC Press) FREE full text online
• Hadley Wickham
RECOMMENDED BOOKS
53. 53
• StackExchange
• StackOverflow http://stackoverflow.com/tags/r
• CrossValidated http://stats.stackoverflow.com
• Post questions with MWE = Minumum Working Example
• R-bloggers https://www.r-bloggers.com
• News, Tutorials, Jobs … common issues often documented clearly
• R help mailing list https://stat.ethz.ch/mailman/listinfo/r-help
• Quick-R (2014) http://www.statmethods.net/
• Robert I. Kabacoff, Ph.D.
RECOMMENDED ONLINE HELP & TUTORIALS
54. www.modusoperandi.com
709 South Harbor City Blvd., Suite 400
Melbourne, FL 32901-1936
321-473-1400
Stacy Irwin
sirwin@modusoperandi.com
sirwin@gmail.com
56. 56
ls(), rm( ) list, remove objects from memory
getwd(), setwd("…") get, set working directory
grep, lgrep, gsub grep family
match, identical, setdiff, unique, %in%
matching functions
• String manipulation:
nchar, strsplit, unlist, paste0, pmatch
toupper, tolower, sub, strtrim, strtoi
help.search(keyword = "character")
COMMON FUNCTIONS
57. 57
for(i in 1:100) ...
for(myLetter in LETTERS) ...
while(i < 100) i <- i+5
if(this == that) <do_something>
if(this %in% that) {
<do_this1>
<do_this2>
} else {
<do_that1>
<do_that2>
}
LOOPS AND CONDITIONAL FUNCTIONS
Hinweis der Redaktion
I learned about R while completing my PhD, from a friend of mine studying meteorology. She was using R to process huge amounts of worldwide temperature and moisture data, subset it, and analyze it to gage the effectiveness of weather prediction models. Like her, I was dealing with a big bucket of data and comparing it against models, but mine had to do with stars, planets, and their measured properties. I borrowed her intro book on R, and the rest is history. R is a fun and versatile language, and I hope I can share some of this excitement with you through this presentation. Emphasis in this lesson is on data structures, and light on analysis examples (to be covered at later date) but you are exposed to the basic concepts and commands.
Syntax will be addressed throughout
Code/typing/responses will be in Courier New font
Adopts “Advanced R” style:
Input lines are shown as you would type them
Output lines are commented with #>
Easy copy-paste with comparison results
In reality, the prompt is just “> ”
R is a language
Interpreted language – great for statistical analysis, visualization, pub-quality graphics, data exploration and manipulation, scripting
800,000 lines of code
45% C
19% R
17% Fortran
R is an implementation
Open source version (GNU) of the S language and environment developed by Bell Labs by John Chambers, et al. R is a GNU Project and is licensed under the GNU General Public License (GPL). Current version 3.3.2
(Bell Labs (was: AT&T, now Lucent) in Aug 1993)
Much S code runs unaltered in R, and highly extensible (needs additional packages/libraries)
file manipulation
prototyping
botatistics
astrophysics
image processing
text analysis
statistical analysis
fast visualization
pub-quality graphics
data exploration
ETL
scripting
NLP
Word Clouds
time series
networks
graph analysis
outliers
patterns
AB testing
machine learning
neural.nets
JSON
HTML
XML
Poorly written code – most R users are not programmers or software engineers, no formal training. They are exploring data for a quick answer.
R won’t do: it has problems handing certain kinds of problems. For example, raw byte data
“R is slow”
Poorly written code
The implementation is at fault
5 different ways to access a value from a dataframe
Slowest takes 30X longer than the fastest
More advanced topics include utilizing the object-oriented systems of R, using virtual memory, parallel programming with multiple cores, and rewriting functions in C++ within R
GDELT data cleaning
If R is "slow" …. often due to poorly written code – most R users are not programmers or software engineers, no formal training. They are exploring data for a quick answer.
Admittedly, it has problems handing certain kinds of problems. For example, raw byte data manipulation and conversion
More advanced topics include utilizing the object-oriented systems of R, using virtual memory, parallel programming with multiple cores, and rewriting functions in C++ within R (so they can run faster)
There are no single value variables, but instead these are vectors of length 1. I’ll note here that the indexing of vectors, lists, dataframes, etc. start at 1, not 0. In this tutorial we will look mainly at vectors, lists, and data frames.
Other data types: complex and raw, not covered here.
Character strings can be enclosed in single or double quotes, just be consistent
":" produces a sequence of numbers at unit intervals, not necessarily integers!
1.1:5 produces 1.1 2.1 3.1 4.1
(5.1 is beyond the limit of the expression)
c() produced vector of n elements – “combine”
Overwriting variables with re-assignment allowed
Built-in constants: LETTERS letters month.abb month.name pi
Index/select elements with [ ]
Coersion occurs implicitly: 1 is an integer, but "1" is a character
There is no mandatory character to end a line of code, but if a parenthetical (or bracketed) expression is incomplete when enter is pressed, at least for simple expressions, R will wait for you to complete and close the expression.
Considered good practice to use the arrow form of assignment.
Logical comparison generate T/F and are similar to other languages’ syntax.
Most comparisons performed vector-wise, or by multiples of vector lengths.
Tip: everything is divisible by 1
x <- 8; print(2*(6+c((1:4)+x)))
#> [1] 30 32 34 36
Considered good practice to use the arrow form of assignment.
Logical comparison generate T/F and are similar to other languages’ syntax.
Most comparisons performed vector-wise, or by multiples of vector lengths
Join lists with c()
Join lists with c()
Until now, only worked with vectors, 1D
3 ways to subset lists (also works with vectors!)
Simplifying vs. Preserving
Try these simple subsetting examples, examine their structure with
str(X[[1]])
str(X[1])
Join lists with c()
mtcars : Motor Trend road test data from 1974
Dataframes are groups of lists
Columns may be different types, but each must be self-consistent
mtcars : Motor Trend road test data from 1974
Dataframes are groups of lists
Columns may be different types, but each must be self-consistent
first list element [1]
is referenced byt the preserving subset type.
Until now, only worked with vectors, 1D
3 ways to subset lists (also works with vectors!)
Try these simple subsetting examples, examine their structure with str()
Join lists with c()
Subsetting and filtering
MT is a new subsetted data frame
Creating a dataframe from scratch.
A dataframe is a special kind of list: the elements (columns!) of the dataframe must all be the same length!
rbind/cbind, for adding rows/columns to dataframes
Do you think any of them will produce an error? Try it!
### also short-cutting: &&, ||
### also short-cutting: &&, ||
Options (arguments) can specify whether header exists, col.names, col types, quote handling, missing values, etc.
Also: read.xls(), write.xls()
It's considered better practice to load packages with library(), instead of require()
Quick-R author: "I created Quick-R for one simple reason. I wanted to learn R and I am a teacher at heart. The easiest way for me to learn something is to teach it."
Textual Machine Learning for Topic Extraction and Document Similarity Matching