Data analysis with R

SHARETHIS
DATA ANALYSIS with R
Hassan Namarvar

2
WHAT IS R?
• R is a free software programming language and software
development for statistical computing and graphics.
• It is similar to S language developed at AT&T Bell Labs by Rick
Becker, John Chambers and Allan Wilks.
• R was initially developed by Ross Ihaka and Robert Gentleman
(1996), from the University of Auckland, New Zealand.
• R source code is written in C, Fortran, and R.

3
R PARADIGMS
Multi paradigms:
– Array
– Object-oriented
– Imperative
– Functional
– Procedural
– Reflective

4
STATISTICAL FEATURES
• Graphical Techniques
• Linear and nonlinear modeling
• Classical statistical tests
• Time-series analysis
• Classification
• Clustering
• Machine learning

5
PROGRAMMING FEATURES
• R is an interpreted language
• Access R through a command-line interpreter
• Like MATLAB, R supports matrix arithmetic
• Data structures:
– Vectors
– Metrics
– Array
– Data Frames
– Lists

6
ADVANTAGES OF R
• The most comprehensive statistical analysis package
available.
• Outstanding graphical capabilities
• Open source software – reviewed by experts
• R is free and licensed under the GNU.
• R has over 5,578 packages as of May 31, 2014!
• R is cross-platform. GNU/Linux, Mac, Windows.
• R plays well with CSV, SAS, SPSS, Excel, Access, Oracle, MySQL,
and SQLite.

7
HOW TO INSTALL R?
• Download an install the latest version from:
– http://cran.r-project.org
• Install packages from R Console:
– > install.packages(‘package_name’)
• R has its own LaTeX-like documentation:
– > help()

8
STARTING WITH R
• In R console:
– > x <- 2
– > x
– > y <- x^2
– > y
– > ls()
– > rm(y)
• Vectors:
– > v <- c(4, 7, 23.5, 76.2, 80)
– > Summary(v)

9
STARTING WITH R
• Histogram:
– > r <- rnorm(100)
– > summary(r)
– > plot(r)
– > hist(r)
• QQ-Plot (Quantile):
– > qqplot(r, rnorm(1000))

10
STARTING WITH R
• Factors:
– > g <- c(‘f’, ‘m’, ‘m’, ‘m’, ‘f’, ‘m’, ‘f’, ‘m’)
– > h <- factor(g)
– > table(g)
• Matrices:
– > r <- rnorm(100)
– > dim(r) <- c(50,2)
– > r
– > Summary(r)
– > M <- matrix(c(45, 23, 66, 77, 33, 44), 2, 3,
byrow=T)

11
STARTING WITH R
• Data Frames:
– > n = c(2, 3, 5)
– > s = c("aa", "bb", "cc")
– > b = c(TRUE, FALSE, TRUE)
– > df = data.frame(n, s, b)
• Built-in Data Set:
– > state.x77
– > st = as.data.frame(state.x77)
– > st$Density = st$Population * 1000 / st$Area
– > summary(st)
– > cor(st)
– > pairs(st)

12
STARTING WITH R
Population
3000 5500 68 71 40 55 0e+00 5e+05
015000
30005500
Income
Illiteracy
0.52.0
6871
Life Exp
Murder
2814
4055
HS Grad
Frost
0100
0e+005e+05
Area
0 15000 0.5 2.0 2 8 14 0 100 0 600
0600
Density

13
LINEAR REGRESSION MODEL IN R
• Linear Regression Model:
– > x <- 1:100
– > y <- x^3
– Model y = a + b . x
– > lm(y ~ x)
– > model <- lm(y ~ x)
– > summary(model)
– > par(mfrow=c(2,2))
– > plot(model)

14
LM MODEL
– Call:
– lm(formula = y ~ x)
– Residuals:
– Min 1Q Median 3Q Max
– -129827 -103680 -29649 85058 292030
– Coefficients:
– Estimate Std. Error t value Pr(>|t|)
– (Intercept) -207070.2 23299.3 -8.887 3.14e-14 ***
– x 9150.4 400.6 22.844 < 2e-16 ***
– ---
– Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
– Residual standard error: 115600 on 98 degrees of freedom
– Multiple R-squared: 0.8419, Adjusted R-squared: 0.8403
– F-statistic: 521.9 on 1 and 98 DF, p-value: < 2.2e-16

15
LM MODEL
0 20 40 60 80 100
0e+002e+054e+056e+058e+051e+06
y=x^3
x
y

16
DIAGNOSIS PLOT
-2e+05 2e+05 4e+05 6e+05
-1e+051e+053e+05
Fitted values
Residuals
Residuals vs Fitted
100
99
98
-2 -1 0 1 2
-10123
Theoretical Quantiles
Standardizedresiduals
Normal Q-Q
100
99
98
-2e+05 2e+05 4e+05 6e+05
0.00.51.01.5
Fitted values
Scale-Location
100
99
98
0.00 0.01 0.02 0.03 0.04
-10123
Leverage
Cook's distance
Residuals vs Leverage
100
99
98

17
• Model Built-in Data:
– > colnames(st)[4] = "Life.Exp"
– > colnames(st)[6] = "HS.Grad"
– model1 = lm(Life.Exp ~ Population + Income
+ Illiteracy + Murder + HS.Grad + Frost +
Area + Density, data=st)
– > summary(model1)
– > model2 <- step(model1)
– > model3 = update(model2, .~.-Population)
– > Summary(model3)

18
• Confidence limits on Estimated Coefficients:
– > confint(model3)
– > predict(model3, list(Murder=10.5,
HS.Grad=48, Frost=100))

19
OUTLIERS
• Boxplot:
– > v <- rnorm(100)
– > v = c(v,10)
– > boxplot(v)
– > rug(jitter(v), side=2)
-20246810

20
PROBABILITY DENSITY FUNCTION
• PDF:
– > r <- rnorm(1000)
– > hist(r, prob=T)
– > lines(density(r), col="red") Histogram of r
r
Density
-3 -2 -1 0 1 2 3
0.00.10.20.30.4

21
CASE STUDY: SHARETHIS EXAMPLE
• Relationship of clicks with winning price and Impression on
ADX:
• Data
– Analyzed ADX Hourly Impression Logs
• Method
– Detected outliers
– Predicted clicks using a regression tree model

22
• Outlier Detection:
Clicks Impressions

23
• Regression Tree
– One of the most powerful classification/regression
– > library(rpart)
– > fit <- rpart(log(CLK) ~ log(IMP) + AVG_PRICE +
SD_PRICE, data=x)
– > plot(fit)
– > text(fit)
– > plot(predict(fit), log(x$CLK))

24
• Regression Tree
|
log(IMP)< 9.33
log(IMP)< 8.349 log(IMP)< 11.28
SD_PRICE< 0.2604
log(IMP)>=10.04 log(IMP)< 10.39
AVG_PRICE>=1.713 AVG_PRICE>=1.247
AVG_PRICE< 0.8555
log(IMP)< 12.49
0.751 1.387
1.541 2.869
1.959 2.729
3.003
3.104 4.331
3.577 4.753

25
• Predict Log of Clicks
0 1 2 3 4 5 6 7
1234
log(x$CLK)
predict(fit)

26
CASE STUDY: COLOR DETECTION
• Detect color from product image:
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0
-1.0 -0.5 0.0 0.5 1.0
-1.0-0.50.00.51.0

27
RESOURCES
• Books:
– An Introduction to Statistical Learning: with
Applications in R by G. James, D. Witten, T. Hatie,
R. Tibshirani, 2013
– The Art of R Programming: A Tour of Statistical
Software Design, N. Matloff, 2011
– R Cookbook (O'Reilly Cookbooks), P. Teetor, 2011
• R Blog:
– http://www.r-bloggers.com

Data analysis with R

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Data analysis with R

Ähnlich wie Data analysis with R (20)

Mehr von ShareThis

Mehr von ShareThis (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data analysis with R

Hinweis der Redaktion