Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
R for Pirates. ESCCONF October 27, 2011
1. R for Pirates
Mandi Walls
@lnxchk
EscConf, Boston, MA
October 27, 2011
2. whoami
• stats misfit
• R tinkerer
• large-farm runner
• not a professional statistician :D
3. What is R
• Scripting language for stats work
• Inspired by earlier S (for statistics)
developed at AT&T
• FOSS
• Syntax inherits through Algol family, so
looks somewhat like C/C++
4. What Does R Do?
• Manipulate data
• Complex Modeling and
Computation
• Graphics and
Visualization
6. But Other Math Stuff!
• Mathematica
• MatLab
• Minitab
• MAPLE
• Excel (yes. shutup h8rs. ask your CFOs what they
use)
• R provides sophisticated statistical and modeling
capabilities, and is extendible through your own code
7. Get R
• Available for Linux, Mac, Windows
• http://www.r-project.org/
8. Fire!
• R console on Mac
• Interactive interpreter
for your R needs
• Can also run from the
command line: R
9. R Basics
• R considers all elements
to be vectors
• A single number is a
one-element vector
• Use <- for assignment
• Use c() to concatenate
values into a vector
12. Functions
• Looks familiar!
• Let’s see one!
• “evencount” counts the number of even ints in a vector
13.
14. Datatypes
• Vectors, the important ones
• Scalars are really single-element vectors
• Character strings
• Matrices, rectangular arrays of numbers
• Lists
• Tables, useful for data transitions and temp work
15. Vectors
• R’s most-used data structure
• All elements in a vector must have the same mode
or data type
• To add values to a vector, you concatenate into it
with the c() function
• Many mathematical functions can be performed on
a vector, they can also be traversed like arrays
• Index starts at 1, not 0!
17. Character Strings
• Single-element vectors • Can do normal string
with mode character things, like
> t <- paste("yo","dawg")
> y <- "abc"
> t
> length(y)
[1] "yo dawg"
[1] 1
> u <- strsplit(t,"")
> mode(y)
> u
[1] "character"
[[1]]
[1] "y" "o" " " "d" "a" "w" "g"
19. Lists
• Contain elements of different types
• Have a particular syntax
> x <- list(u=2, v="abc")
> x
$u
[1] 2
$v
[1] "abc"
> x$u
[1] 2
20. Data Frames
• Matrices are limited to only a single type for all elements
• A data frame can contain different types of data, can be read
in from a file or created in realtime
> df <- data.frame(list(kids=c("Olivia","Madison"),ages=c(10,8)))
> df
kids ages
1 Olivia 10
2 Madison 8
> df$ages
[1] 10 8
21. Putting R to Work
• Read in a log file:
access <- read.table("access.log", header=FALSE)
> head(access)
V1 V2 V3 V4 V5 V6 V7 V8
1 192.168.1.10 - - [23/Oct/2011:07:03:33 -0500] GET /menu/menu.js HTTP/1.1 401 401
2 192.168.1.10 - - [23/Oct/2011:07:03:33 -0500] GET /menu/menu.js HTTP/1.1 200 1970
3 192.168.1.10 - - [23/Oct/2011:07:03:33 -0500] GET /menu/menu.css HTTP/1.1 200 2258
22. Fun with Plots
• This plot series is going to
make use of the “return
codes” from the access log
• We’ll do a series of plots
that gradually get more
sophisticated
• This is a basic histogram of
the data, it’s not much fun
27. Writing Graphical
Output to Files
• Set up the output target by calling a graphics function:
• pdf(), png(), jpeg(), etc
• jpeg(“/var/www/images/returncodes-date.jpg”)
• Call the plot function you have chosen, then call dev.off()
• Can be used in batch mode to create graphics from your data
28. Shopping is Hard, Let’s
Do Math
• Read in some load averages (one-min)
loadavg<-read.table("load_avg.txt")
head(loadavg)
V1
1 3.79
2 3.11
3 2.94
4 4.81
29. Summary Stats
• Summarize the data with one function call
• Gives the min, max, mean, median, and quartiles
summary(loadavg)
V1
Min. :0.760
1st Qu.:1.390
Median :1.970
Mean :2.302
3rd Qu.:3.080
Max. :5.070
31. Same Thing, 3
Datacenters
> cpu<-read.table("cpu")
> head(cpu)
V1 V2
1 3.78 smq
2 2.57 smq
3 3.69 smq
4 0.86 smq
• Looks like there’s outliers. That could spell
trouble! You found them with R awesomeness.
Horay!
boxplot(cpu[,1] ~ cpu[,2], xlab="Load Average at Time t, by Datacenter", ylab="One-Minute Load Average", main="Box Plot
of One-Minute Load Average, FEs", col=topo.colors(3))
32. Running R in Your
Workflow
• The little bit of boxplotting we did eariler, in a script:
[mandi@mandi ~]$ cat sample.R
#!/usr/bin/env Rscript
cpu<-read.table("cpu")
jpeg("./sample.jpg")
boxplot(cpu[,1] ~ cpu[,2], xlab="Load Average at Time t, by
Datacenter", ylab="One-Minute Load Average", main="Box Plot
of One-Minute Load Average, FEs", col=heat.colors(3))
dev.off()
[mandi@mandi ~]$ Rscript sample.R > /dev/null
[mandi@mandi ~]$ ls -l sample.jpg
-rw-rw-r-- 1 mandi staff 20137 Oct 24 20:44 sample.jpg
34. What Else?
• R can read data input from a variety of files with regular
formats
• R can also fetch data from the internet using the url()
function
• R has a number of functions available for dealing with
reading data, creating data frames or other structures, and
converting string text into numerical data modes
• Extended packages provide support for structured data
formats like JSON.
35. References
• http://www.slideshare.net/dataspora/an-
interactive-introduction-to-r-programming-
language-for-statistics
• http://www.harding.edu/fmccown/R/
• Art of R Programming, Norman Matloff, Copyright
2011 No Starch Press
• Statistical Analysis with R, John M. Quick, Copyright
2011 Packt Publishing