2. Session Road Map
First Steps ANOVA
Importing Data into R PCA
R Basics Clustering
Data Visualization Time Series
Correlation & Regression Programming
t-Test Publication-Quality output
Chi-squared Test
2
3. First Steps (1)
R is one of the most popular platforms for data
analysis and visualization currently available. It is
free and open source software:
http://www.r-project.org
Take advantage of its coverage and availability of
new, cutting edge applications/techniques.
R will enable us to develop and distribute solutions
to our NARS with no hidden license cost.
3
6. First Steps (4)
citation()
R Development Core Team (2009). R: A language and
environment for statistical computing. R Foundation
for Statistical Computing, Vienna, Austria. ISBN
3-900051-07-0, URL http://www.R-project.org.
6
7. First Steps (5)
If you know the name of the function you want help
with, you just type a question mark ? at the command
line prompt followed by the name of the function:
?read.table
7
8. First Steps (6)
Sometimes you cannot remember the precise name of
the function, but you know the subject on which you
want help. Use the help.search function with your
query in double quotes like this:
help.search("data input")
8
9. First Steps (7)
To see a worked example just type the function name:
example(mean)
mean> x <- c(0:10, 50)
mean> xm <- mean(x)
mean> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50
mean> mean(USArrests, trim = 0.2)
Murder Assault UrbanPop Rape
7.42 167.60 66.20 20.16
9
10. First Steps (8)
There are hundreds of contributed packages for
R, written by many different authors (to implement
specialized statistical methods). Most are available for
download from CRAN (http://CRAN.R-project.org)
List all available packages: library()
Load package “ggplot2”: library(ggplot2)
Documentation on package library(help=ggplot2)
10
11. Importing Data into R (1)
data <- read.table("D:/path/file.txt", header=TRUE)
data <- read.csv(file.choose(), header=TRUE, sep=";")
data <- edit(data)
fix(data)
head(data)
tail(data)
tail(data, 10)
11
12. Importing Data into R (2)
In order to refer to a vector by name with an R session,
you need to attach the dataframe containing the
vector. Alternatively, you can refer to the dataframe
name and the vector name within it, using the element
name operator $ like this: mtcars$mpg
?mtcars
attach(mtcars)
mpg
12
14. Importing Data into R (4)
# Read data left on the clipboard
data <- read.table("clipboard", header=T)
# ODBC
library(RODBC)
db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd")
raw <- sqlQuery(db1, "SELECT * FROM table1")
# XLSX
library(XLConnect)
xls <- loadWorkbook("my_file.xlsx", create=F)
raw <- as.data.frame(readWorksheet(xls,sheet='Sheet1'))
14
15. R Basics (1)
max(x) maximum value in x
min(x) minimum value in x
mean(x) arithmetic average of the values in x
median(x) median value in x
var(x) sample variance of x
sd(x) standard deviation of x
cor(x,y) correlation between vectors x and y
summary(x) generic function used to produce
result summaries of the results of various functions
15
16. R Basics (2)
abs(x) absolute value
floor(2.718) largest integers not greater than
ceiling(3.142) smallest integer not less than x
asin(x) inverse sine of x in radians
round(2.718, digits=2) returns 2.72
x <- 1:12; sample(x) Simple randomization
RCBD randomization:
RCBD <- replicate(3, sample(x))
16
17. R Basics (3)
Common Data Transformation:
Nature of Data Transformation R Syntax
Measurements (lengths, weights, etc) loge log(x)
log10 log(x, 10)
Log10 log10(x)
Log x+1 log(x + 1)
Counts (number of individuals, etc) sqrt(x)
Percentages (must be proportions) arcsin asin(sqrt(x))*180/pi
* where x is the name of the vector (variable) whose values are to be transformed.
17
18. R Basics (4)
Vectorized computations:
Any function call or operator apply to a vector in will
automatically operates directly on all elements of the
vector.
nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8
The recycling rule:
The shorter vector is replicated enough times so that the
result has the length of the longer vector, then the
operator is applied.
1:10 + 1:3 # 2 4 6 5 7 9 8 10 12 11
18
19. R Basics (5)
mydata <- matrix(rnorm(30), nrow=6)
mydata
# calculate the 6 row means
apply(mydata, 1, mean)
# calculate the 5 column means
apply(mydata, 2, mean)
apply(mydata, 2, mean, trim=0.2)
19
21. R Basics (7)
Surprisingly, the base installation doesn’t provide
functions for skew and kurtosis, but you can add your
own:
m <- mean(x)
n <- length(x)
s <- sd(x)
skew <- sum((x-m)^3/s^3)/n
kurt <- sum((x-m)^4/s^4)/n – 3
21
22. Data Visualization (1)
Pairs for a matrix
of scatter plots
of every variable
against every
other:
?mtcars
pairs(mtcars)
Voilà!
22
24. Data Visualization (3)
Gives a scatter plot if x is continuous, and a box-and-
whisker plot if x is a factor. Some people prefer the
alternative syntax plot(y~x):
attach(mtcars)
plot(wt, mpg)
plot(cyl, mpg)
cyl <- factor(cyl)
plot(cyl, mpg)
24
30. Correlation and Regression (1)
If you want to determine the significance of a
correlation (i.e. the p value associated with the
calculated value of r) then use cor.test rather than cor.
cor(wt, mpg)
[1] -0.8676594
The value will vary from -1 to +1. A -1 indicates perfect
negative correlation, and +1 indicates perfect positive
correlation. 0 means no correlation.
30
31. Correlation and Regression (2)
cor.test(wt, qsec)
Pearson's product-moment correlation
data: wt and qsec
t = -0.9719, df = 30, p-value = 0.3389
alternative hypothesis: true correlation is not
equal to 0
95 percent confidence interval:
-0.4933536 0.1852649
sample estimates:
cor
-0.1747159
31
32. Correlation and Regression (3)
cor.test(wt, mpg)
Pearson's product-moment correlation
data: wt and mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not
equal to 0
95 percent confidence interval:
-0.9338264 -0.7440872
sample estimates:
cor
-0.8676594
32
34. Correlation and Regression (5)
Fits a linear model with normal errors and constant
variance; generally this is used for regression analysis
using continuous explanatory variables.
fit <- lm(y ~ x)
summary(fit)
plot(x, y)
# Sample of multiple linear regression
fit <- lm(y ~ x1 + x2 + x3)
34
35. Correlation and Regression (6)
Call:
lm(formula = mpg ~ wt)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
35
36. Correlation and Regression (7)
The great thing about graphics in R is that it is
extremely straightforward to add things to your plots.
In the present case, we might want to add a regression
line through the cloud of data points. The function for
this is abline which can take as its argument the linear
model object:
abline(fit)
Note: abline(a, b) function adds a regression line
with an intercept of a and a slope of b
36
38. Correlation and Regression (9)
Predict is a generic built-in function for predictions
from the results of various model fitting functions:
predict(fit, list(wt = 4.5))
[1] 13.23500
38
40. Correlation and Regression (11)
What do you do if you identify problems?
There are four approaches to dealing with violations of
regression assumptions:
Deleting observation
Transforming variables
Adding or deleting variables
Using another regression approach
40
41. Correlation and Regression (12)
You can compare the fit of two nested models using
the anova() function in the base installation. A nested
model is one whose terms are completely included in
the other model.
fit1 <- lm (y ~ A + B + C)
fit2 <- lm (y ~ A + C)
anova(fit1, fit2)
If the test is not significant (i.e. p > 0.05), we conclude
that B in this case don’t add to the linear prediction
and we’re justified in dropping it from our model.
41
42. Correlation and Regression (13)
# Bootstrap 95% CI for R-Squared
library(boot)
rsq <- function(formula, data, indices) {
fit <- lm(formula, data= data[indices,])
return(summary(fit)$r.square)
}
rs <- boot(data=mtcars, statistic=rsq, R=1000,
formula=mpg~wt+disp)
boot.ci(rs, type="bca") # try print(rs) and plot(rs)
42
43. t-Test (1)
Comparing two sample means with normal errors
(Student’s t test, t.test)
t.test(a, b)
t.test(a, b, paired = TRUE)
# alternative argument options:
# "two.sided", "less", "greater"
a <- qsec[cyl == 4]
b <- qsec[cyl == 6]
c <- qsec[cyl == 8]
43
44. t-Test (2)
t.test(a, b)
Welch Two Sample t-test
data: a and b
t = 1.4136, df = 12.781, p-value = 0.1814
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
-0.6159443 2.9362040
sample estimates:
mean of x mean of y
19.13727 17.97714
44
45. t-Test (3)
t.test(a, c)
Welch Two Sample t-test
data: a and c
t = 3.9446, df = 17.407, p-value = 0.001005
alternative hypothesis: true difference in means is
not equal to 0
95 percent confidence interval:
1.102361 3.627899
sample estimates:
mean of x mean of y
19.13727 16.77214
45
46. t-Test (4)
(a) Test the equality of variances assumption:
ev <- var.test(a, c)$p.value
(b) Test the normality assumption:
an <- shapiro.test(a)$p.value
bn <- shapiro.test(c)$p.value
46
47. Chi-squared Test (1)
Construct hypotheses based on qualitative – categorical data:
myTable <- table(am, cyl)
myTable
cyl
am 4 6 8
automatic 3 4 12
manual 8 3 2
47
48. Chi-squared Test (2)
chisq.test(myTable)
Pearson's Chi-squared test
data: myTable
X-squared = 8.7407, df = 2, p-value = 0.01265
The expected counts under the null hypothesis:
hisq.test(myTable)$expected
cyl
am 4 6 8
automatic 6.53125 4.15625 8.3125
manual 4.46875 2.84375 5.6875
48
50. ANOVA (1)
A method which partitions the total variation in the
response into the components (sources of variation) in
the above model is called the analysis of variance.
table(N, S, Rep)
N <- factor(N)
S <- factor(S)
Rep <- factor(Rep)
50
51. ANOVA (2)
The best way to
understand the two
significant interaction
terms is to plot them using
interaction.plot like this:
interaction.plot(S, N, Yield)
51
59. ANOVA (10)
summary(aov(Yield ~ N * S + Error(Rep))) #RCB
Error: Rep
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 2 0.30191 0.15095
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
N 2 4.5818 2.2909 51.2035 5.289e-09 ***
S 3 0.9798 0.3266 7.3001 0.001423 **
N:S 6 0.6517 0.1086 2.4277 0.059281 .
Residuals 22 0.9843 0.0447
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
59
60. ANOVA (11)
In a split-plot design, different treatments are applied
to plots of different sizes. Each different plot size is
associated with its own error variance.
The model formula is specified as a factorial, using the
asterisk notation. The error structure is defined in the
Error term, with the plot sizes listed from left to
right, from largest to smallest, with each variable
separated by the slash operator /.
model <- aov(Yield ~ N * S + Error(Rep/N))
60
61. ANOVA (12)
Error: Rep
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 2 0.30191 0.15095
Error: Rep:N
Df Sum Sq Mean Sq F value Pr(>F)
N 2 4.5818 2.29088 55.583 0.001206 **
Residuals 4 0.1649 0.04122
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
S 3 0.97983 0.32661 7.1744 0.002280 **
N:S 6 0.65171 0.10862 2.3860 0.071313 .
Residuals 18 0.81943 0.04552
61
62. ANOVA (13)
Analysis of Covariance:
# f is treatment factor
# x is variate acts as covariate
model <- aov(y ~ x * f)
Split both main effects into linear and quadratic parts.
contrasts <- list(N = list(lin=1, quad=2),
S = list(lin=1, quad=2))
summary(model, split=contrasts)
62
63. PCA (1)
The idea of principal components analysis (PCA) is to
find a small number of linear combinations of the
variables so as to capture most of the variation in the
dataframe as a whole.
d2 <- cbind(wt, disp/10, hp/10, mpg, qsec)
colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec")
63
67. Clustering (1)
We define similarity on the basis of the distance
between two samples in this n-dimensional space.
Several different distance measures could be used to
work out the distance from every sample to every other
sample. This quantitative dissimilarity structure of the
data is stored in a matrix produced by the dist function:
rownames(d2) <- rownames(mtcars)
my.dist <- dist(d2, method="euclidian")
67
68. Clustering (2)
Initially, each sample is assigned to its own cluster, and
then the hclust algorithm proceeds iteratively, at each
stage joining the two most similar clusters, continuing
until there is just a single cluster (see ?hclust for
details).
my.hc <- hclust(my.dist, "ward")
68
69. Clustering (3)
We can plot the object called my.hc, and we specify
that the leaves of the hierarchy are labeled by their
plot numbers
plot(my.hc, hang=-1)
g <- rect.hclust(my.hc, k=4, border="red")
Note:
When the hang argument is set to '-1' then all leaves
end on one line and their labels hang down from 0.
69
71. Clustering (5)
Partitioning into a number of clusters specified by the user.
gr <- kmeans(cbind(disp, hp), 2)
plot(disp, hp, col = gr$cluster, pch=19)
points(gr$centers, col = 1:2, pch = 8, cex=2)
71
75. Time Series (1)
First, make the data variable into a time series object
# create time-series objects
beer <- ts(beer, start=1956, freq=12)
It is useful to be able to turn a time series into
components. The function stl performs seasonal
decomposition of a time series into seasonal, trend
and irregular components using loess.
75
76. Time Series (2)
The remainder component is the residuals from the
seasonal plus trend fit. The bars at the right-hand side
are of equal heights (in user coordinates).
# Decompose a time series into seasonal,
# trend and irregular components using loess
ts.comp <- stl(beer, s.window="periodic")
plot(ts.comp)
76
78. Programming (1)
We can extend the functionality of R by writing a
function that estimates the standard error of the mean
SEM <- function(x, na.rm = FALSE) {
if (na.rm == TRUE) VAR <- x[!is.na(x)]
else VAR <- x
SD <- sd(VAR)
N <- length(VAR)
SE <- SD/sqrt(N - 1)
return(SE)
}
78
79. Programming (2)
You can define your own operator of the form %any%
using any text string in place of any. The function
should be a function of two arguments.
"%p%" <- function(x,y) paste(x,y,sep=" ")
"Hi" %p% "Khaled"
[1] "Hi Khaled"
79
80. Programming (3)
setwd("path/to/folder")
sink("output.txt")
cat("Intercept t Slope")
a <- fit$coefficients[[1]]
b <- fit$coefficients[[2]]
cat(paste(a, b, sep="t"))
sink()
jpeg(filename="graph.jpg", width=600, height=600)
plot(wt, mpg); abline(fit)
dev.off()
80
81. Programming (4)
The code for R functions can be viewed, and in most
cases modified, if so is desired using fix() function.
You can trigger garbage collection by call gc() function
which will report few memory usage statistics.
Basic tool for code timing is: system.time(commands)
tempfile() give a unique file name in temporary
writable directory deleted at the end of the session.
81
82. Programming (5)
Take control of your R code! RStudio is a free and open
source integrated development environment for R. You
can run it on your desktop (Windows, Mac, or Linux) :
Syntax highlighting, code completion, etc...
Execute R code directly from the source editor
Workspace browser and data viewer
Plot history, zooming, and flexible image & PDF export
Integrated R help and documentation
and more (http://www.rstudio.com/ide/)
82
84. Programming (7)
If want to evaluate the quadratic x2−2x +4 many times
so we can write a function that evaluates the function
for a specific value of x:
my.f <- function(x) { x^2 - 2*x + 4 }
my.f(3)
[1] 7
plot(my.f, -10, +10)
84
86. Programming (9)
We can find the minimum of the function using:
optimize(my.f, lower = -10, upper = 10)
$minimum
[1] 1
$objective
[1] 3
which says that the minimum occurs at x=1 and at that
point the quadratic has value 3.
86
87. Programming (10)
We can integrate the function over the interval -10 to
10 using:
integrate(my.f, lower = -10, upper = 10)
746.6667 with absolute error < 4.1e-12
which gives an answer together with an estimate of the
absolute error.
87
89. Publication-Quality Output (1)
Research doesn’t end when the last statistical analysis
is completed. We need to include the results in a
report. xtable function convert an R object to an xtable
object, which can then be printed as a LaTeX table.
LaTeX is a document preparation system for high-
quality typesetting (http://www.latex-project.org).
library(xtable)
print(xtable(model))
89
91. Publication-Quality Output (3)
ggplot2 package is an elegant alternative to the base
graphics system, it has two complementary uses:
Producing publication quality graphics using very
simple syntax that it similar to that of base graphics.
ggplot2 tends to make smart default choices for color,
scale etc.
Making more sophisticated/customized plots that go
beyond the defaults.
91
93. Final words!
How Large is Your Family?
How many brothers and sisters are there in your family
including yourself? The average number of children in
families was about 2. Can you explain the difference
between this value and the class average?
Birthday Problem!
The problem is to compute the approximate
probability that in a room of n people, at least two
have the same birthday.
93
c() function is short for concatenateb > 4[1] FALSE FALSE FALSE TRUE TRUEFor complex conditions you can use logical operator where:! indicates logical negation (NOT) & indicate logical AND | indicate logical OR %in% operator searches through all of the entries in the object
help.start() Launch R HTML documentation# Argument List of a Functionargs(read.csv)function (file, header = TRUE, sep = ",", quote = "\\"", dec = ".", fill = TRUE, comment.char = "", ...) NULL
# User CommentR doesn't provide multiline or block comments, you must start each line of multiline comment with #For debugging purposes, you can also surround code that you want the interpreter to ignore with the statement if (FALSE) { … }
Packages -> Install package(s)… Select CRAN mirror, then browse all available packages on CRAN repositoryinstalled.packages() # List of all currently installed packagesinstall.packages(ggplot2) # Install package ggplot2 from CRAN mirrorNote: if you would like to be sure that you execute the function from specific package then you can use the full name like this: package::function()
You can use row.names optional parameter to specifying one variable (i.e. column name in your imported data set) to represent row identifier (like plot #)Set and get working directory:setwd("/path/to/your/directory")getwd()Note:setwd() function won’t create a directory that doesn’t exist. If necessary, you can use the dir.create() function to create new directory, and then use setwd()Read and execute R Code from an external file:source("filename.R")
The detach() function removes the data frame from the search path, it does nothing to the data frame itself. This function is optional but is good programming practice and should be included routinely (see also with() function).# List and remove objects:ls()rm(VAR1, VAR2)rm(list = ls())#How to add one more calculated column into your data frame:data <- transform(data, RYD=SYD/BYD)# Example on date using R language:startday <- as.Date("2002-08-15")today <- Sys.Date()days <- today - startday
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector.a <- c(1, 2, 5, 3, 6, -2, 4)b <- c("one", "two", "three")c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)A matrix is a two-dimensional array where each element has the same mode (numeric,character, or logical). Matrices are created with the matrix function.y <- matrix(1:20, nrow=5, ncol=4, byrow=TRUE)Arrays are similar to matrices but can have more than two dimensions. They are created with an array function.z <- array(1:24, c(2, 3, 4))A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, etc.). data <- data.frame(a, b, c, y, z)
#XMLlibrary(XML)cdCatalog <- xmlToDataFrame("http://www.w3schools.com/xml/cd_catalog.xml")countryCdCatalog <- split(cdCatalog, cdCatalog$COUNTRY)class(countryCdCatalog)[1] "list"names(countryCdCatalog)[1] "EU" "Norway" "UK" "USA" countryCdCatalog$EUNote: be sure that any missing data is properly coded as missing before analyzing the data or the results will be meaningless. For example if the value -999 refer to the missing observation in your Yield data, you can fix it using the following command:x[x == -999] <- NA
mean(x, trim=0.05, na.rm=TRUE)Provides the trimmed mean, dropping the highest and lowest 5 percent of scores as well as any missing values.summary() function will return frequencies for factors and logical vectors.sqrt(x) is the same as x^(0.5)Functions of the form is.datatype() return TRUE or FALSE, while functions of the form as.datatype() converts the argument to that type.Data types: numeric, character, logical, vector, factor, matrix, array, data.frameThe long/detailed way to calculate sd (i.e. standard deviation):n <- length(x)x.mean <- sum(x) / nss <- sum((x – x.mean)^2)x.sd <- sqrt(ss / (n – 1))
signif(24+pi/100, digits=6) # returns 24.0314 (i.e. round x to the specified number of significant digits)Sequence generation: seq(from, to) or seq(from, to, by)The sample() function enables you to take a random sample (with or without replacement) of size n from a dataset (this can be useful in the bootstrapping technique): sample(x, n, replace=FALSE)sample(c("H", "T"), 10, replace=TRUE, prob=c(0.53, 0.47))Note: to ensure that all trainees will get the same randomization if they run the code on their own machines you may use: set.seed(123)x <- c(1, 4, 9, 16, 25, 36)diff(x) # returns c(3, 5, 7, 9, 11)Combine R objects by rows (i.e. rbind) or columns (i.e. cbind):X <- c(0, 1, 2, 3, 4)Y <- c(5, 6, 7, 8, 9)XY <- cbind(X, Y)
Back transformation:log(x) vs. exp(x)log10(x) vs. 10^xsqrt(x) vs. x^2# Scales (mean of 0 and sd of 1) values of x to ranks. # To only center data, use scale=FALSE# To only reduce data use center=FALSEscale(x, center=TRUE, scale=TRUE)# The appropriate representation of values such as # infinity and not a number (NaN) is providedx <- 1/0 # Inf-x # - Infx-x # NaN1/Inf # 0# Classical example show the numerical computing problema <- sqrt(2)a*a == 2 # FALSEa*a – 2 # 4.440892e-16
# set the random seed to insure that you will get the same valuesset.seed(123)# generating a 6 x 5 matrix containing random normal variatesmydata <- matrix(rnorm(30), nrow=6)# calculate trimmed column means (in this case, means based on the middle %60# of the data, with the bottom 20 percent and top 20 percent of values discarded)apply(mydata, 2, mean, trim=0.2)
substr(month.name, 2, 3)[1] "an" "eb" "ar" "pr" "ay" "un" "ul" "ug" "ep" "ct" "ov" "ec"paste("*", month.name[1:4], "*", sep=" ")[1] "* January *" "* February *" "* March *" "* April *" letters[1:4][1] "a" "b" "c" "d"LETTERS[1:4][1] "A" "B" "C" "D"sub("\\\\s", ".", "Hello World") # returns "Hello.World"strsplit("Hello World", "\\\\s+") # returns list contains two elementsstrsplit(month.name, c("a", "e")) # the recycling rule# search for regular expression pattern in the string and returns matching indicesx <- regexpr("pattern", "string", perl=TRUE)
Where as pie charts are ubiquitous in the business world, they are denigrated by most of statisticians. They recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume.To add colors to your categorized boxplot you can try this: plot(cyl, mpg, col=rainbow(nlevels(cyl)))Other vectors of contiguous colors includes: heat.colors(), terrain.colors(), topo.colors(), and cm.colors() For the gray levels you can use something like this: gray(0:n/n) where n <- nlevels(cyl)
Mathematical symbols: You can use expression function to display the text may contain mathematical symbols (i.e. use it in xlab, ylab, or main, etc…)expression(frac(mu,sqrt(2*pi*sigma^2)))The log parameter in plot function indicates whether or which axes should be plotted on a logarithmic scale:log="x", log="y", or log="xy" for Log x-axis scale, Log y-axis scale, or Log x-axis and y-axis scales respectivelytck option in plot function enable you to define the length of tick mark as a fraction of plotting region (a negative number is outside the graph, a positive numbers is inside, 0 suppresses ticks, and 1 creates gridlines); the default is -0.01
mtext() function places text in one of the four margins. The format is: mtext("text to plcae", side=n, line=m, ...)Where side define which margin to place text in (1=bottom, 2=left, 3=top, 4=right), while line indicate the line in the margin starting with 0 (closest to the plot area) and moving out.To create a plot based on probability densities rather than frequencies:hist(qsec, col="gray", probability = TRUE)lines(density(qsec), col = "red", lwd = 3)You can define how many breaks are there in your histogram using breaks option (i.e. breaks=20)You can draw a stand alone density plot (that’s not being superimposed on another graphs) using the following command:plot(density(qsec))
The boxplot summarizes a great deal of information very clearly. The horizontal line shows the median. The bottom and top of the box show the 25th and 75th percentiles, respectively. The vertical dashed lines show one of two things: either the maximum value or 1.5 times the interquartile range of the data (roughly 2 standard deviations). Points more than 1.5 times the interquartile range above and below are defined as outliers and plotted individually.Boxplot can be created for variables by group using formula instead of name of variable alone, example: y ~ A (i.e. a separate boxplot for numiric variable y is generated for each value of categorical variable A), while y ~ A*B formula would produce boxplot for each combination of levels in categorical variables A and B.quantile(qsec) 0% 25% 50% 75% 100% 14.5000 16.8925 17.7100 18.9000 22.9000 quantile(qsec, pro=seq(0, 1, 0.1)) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 14.500 15.534 16.734 17.020 17.340 17.710 18.180 18.607 19.332 19.990 22.900
You can display the relationship between three quantitative variables using 2D scatter plot and use the size of the plotted point to represent the value of the third variable. This approach is referred to as a bubble plot. You want the areas, rather than the radiuses of the circles, to be proportional to the values of a third variable. Given the formula for the radius of a circle r = sqrt(a/pi) the proper call:r <- sqrt(disp[1:10]/pi)symbols(wt[1:10], mpg[1:10], circle=r, inches=0.30, fg="white", bg="lightblue", main="Bubble Plot with point size proportional to disp", xlab="Weight (lb/1000)", ylab="Miles/(US) gallon")text(wt[1:10], mpg[1:10], rownames(mtcars[1:10,]), cex=0.6)# 3D graph code/equationrequire(lattice)g <- expand.grid(x = seq(-10, 10, 0.1), y = seq(-10, 10, 0.1))g$z <- cos(sqrt(g$x^2 + g$y^2))*(1/(g$x^2 + g$y^2)^(1/3))wireframe(z ~ x * y, data = g, scales = list(arrows = FALSE), shade = TRUE)
r = cov(x, y) / sdx * sdyr is called the correlation coefficient, numerator is called the covariance, two terms in the denominator are the standard deviation of x and ycov(x, y) = [1 / (n – 1)] ∑ (x – meanx)(y – meany)Where n is the number of observationsNote: you can also examining all bi-variate relationships in a given data frame in one go using: cor(data)
Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. In scientific notation all numbers are written in the form of a times ten raised to the power of b where the exponent b is an integer, and the coefficient a is any real number:1.294e-10 = 1.294 * 10 ^ -10 = 1.294 / 10 ^ 10 = 0.0000000001294"Correlation does not imply causation" is a phrase used in science and statistics to emphasize that a correlation between two variables does not necessarily imply that one causes the other.
Polynomial Regression (i.e. y = a + b*x + c*x2)quadratic <- lm(y ~ x + I(x^2))summary(quadratic)# to plot it:plot(x, y)x2 <- sort(x)y2 <- fitted(quadratic)[order(x)]lines(x, fitted(quadratic))Mathematical functions can be used in formulas. For example: log(y) ~ x + z + w would predict log(y) from x, z, and w.y ~ log(x) + sin(z) would predict y = a + b * log(x) + c * sin(z)
To add a label for each data point in the graph: text(wt, mpg, row.names(mtcars), cex=0.5, pos=4, col="red")Change font name and font size:par(family="serif", ps=12)Using the identify() function , you can label selected points in a scatter plot with their row number or row name using your mouse.identify(wt, mpg, labels=rownames(mtcars))the cursor will change from a pointer to a crosshair. Clicking on scatter plot points will label them until you select Stop from the Graphics Device menu or right-click on the graph and select Stop from the context menu.# Confidence and prediction bands:x <- seq(min(wt),max(wt),length=100)p <- predict(fit, data.frame(wt=x), interval='prediction')lines(x, p[,2], col='red')lines(x, p[,3], col='red')p <- predict(fit, data.frame(wt=x), interval='confidence')lines(x, p[,2], col='red', lty=2)lines(x, p[,3], col='red', lty=2)
In fact, dropping some observation (outliers) produces a better model fit. But you need to be careful when deleting data. Your models should fit your data, not the other way around! In other cases, the unusual observation may be the most interesting thing about the data you have collected.“Which variables are most important in predicting the outcome?” You implicitly want to rank-order the predictors in terms of relative importance. There have been many attempts to develop a means for assessing the relative importance of predictors. The simplest has been to compare standardized regression coefficients. Standardized regression coefficients describe the expected change in the response variable (expressed in standard deviation units) for a standard deviation change in a predictor variable, holding the other predictor variables constant.Reference: Listing 8.16 ("R in Action" book) relweights() function for calculating relative importance of predictors
LD50 is the median lethal dose of a toxic substance, i.e., that dose of a chemical which kills half the members of a tested population. Basically, what we have is a predictor that is the dose of a chemical and a binary response variable that indicates whether the individual dies or not. The data consist of numbers dead and initial batch size for several doses (e.g. pesticide application), and we wish to know what dose kills 50% of the individuals.dead <- c( 0, 10, 16, 53, 76, 83)dose <- c( 1, 2, 3, 5, 10, 20)batch <- c(85, 85, 85, 85, 85, 85)y <- cbind(dead, batch-dead)model <- glm(y ~ dose, binomial)plot(dose, dead/batch)xv <- seq(0, 20, 0.1)yv <- predict(model, list(dose=xv), type="response")lines(xv, yv)Predict Doses for Binomial Assay model: the function dose.p from the MASS library is run with the model object, specifying the proportion killed.library(MASS)dose.p(model, p=c(0.5,0.9,0.95))
# function to obtain R-Squared from the data rsq <- function(formula, data, indices) { d <- data[indices,] # allows boot to select sample fit <- lm(formula, data=d) return(coef(fit)) # Bootstrapping several Statistics} # bootstrapping with 1000 replications results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp, parallel="multicore") # Linux# view resultsresults plot(results, index=1) # intercept plot(results, index=2) # wt plot(results, index=3) # disp# get 95% confidence interval boot.ci(results, type="bca", index=1) # intercept boot.ci(results, type="bca", index=2) # wt boot.ci(results, type="bca", index=3) # disp
un-paired case: ab.t = (mean(a)-mean(b))/sqrt(var(a)/length(a) + var(b)/length(b))paired case: ab.t = mean(a-b) / sqrt(var(a-b) / length(a-b))Luckily most numeric functions have a na.rm=TRUE option that removes missing values prior to calculations, and applies the function to the remaining values.
(a) Test the equality of variances assumption:if ev > 0.05 we have to use var.equal=TRUE option in the t.test else use var.equal=FALSE (the default value)(b) Test the normality assumption:if an < 0.05 or bn < 0.05 then we have to use wilcox.test instead of t.test
You can turn your frequencies table into proportions using prop.table() function: prop.table(myTable)You can transpose Matrix using: t(myTable)Note: Attributes can be attached to any R object, all attributes can be retrieved using attributes function, or any particular attribute can be accessed or modified using attr function.A matrix is represented as an object/vector of data with "dim" attribute, in this example there is also an extra attribute called "dimnames“rownames(myTable) <- c("Automatic", "Manual") # see also colnames function
The test is not applicable if the expected count for any of the cells is less than 5. R will warn you if this is the case and suggest that the validity of the test results is questionable.
Mosaic Plots are the swiss army knife of categorical data displays. Whereas bar charts are stuck in their univariate limits, mosaic plots and their variants open up the powerful visualization of multivariate categorical data.
# If you import data from Turkey Excel file we have to use dec="," # Data file name is "2F RCB.csv"data<-read.csv(file.choose(), header=TRUE, sep=";", dec=".")attach(data)You can check factor levels using this function: levels(x)You can check the number of factor levels using this function: nlevels(x)Note: by default, character variables are converted to factors when importing data, to suppress this include the option in the read.table function:stringAsFactors=FALSENote: you can undo factor function effect by using as.numeric(x) or as.character(x) functions depends on vector data typeAttributes can be attached to any R object, all attributes can be retrieved using attributes function, or any particular attribute can be accessed or modified using attr function.A factor is represented as an object/vector of data with two extra attributes $levels with a list of distinct values and $class"factor"In factor function, if ordered argument is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.
A graph of yield vs. S at the 3 levels of N seems to indicate a classical nutrient response interaction: no response to S at 0 N, contrasted by a strong response to S when N is no limiting.Experiment design: Two factor factorial RCBD in 3 repsTreatments, 3 x 4 factorial: 12 treatments of all possible combinations of two factors, nitrogen N (3 levels: 0, 180 and 230 kg/ha) and sulphur S (4 levels: 0, 10, 20 and 40 kg/ha)
Broad-sense heritability (h2) of the trait is the ratio of genetic variability (σ2g) to phenotypic variability (σ2g + σ2e). Generally, estimation of variance components is based on ANOVA table:σ2e = Residual Mean Sqσ2g = (Genotypes Mean Sq – Residual Mean Sq) / ReplicationsThus an estimate of heritability is: h2 = (VR – 1) / (VR + Replications – 1)Where VR is the variance ratio for genotypesaov.table <- summary(model)[[1]]svr<- aov.table$"F value"[2]h2 <- (svr - 1) / (svr + nlevels(Rep) - 1)
Symbols commonly used in R formulas:~ Separate response on the left from the explanatory variables on the right+ Separate explanatory variables: Denotes an interaction between predictor variables* A shortcut for denoting all possible interactions^ Denotes interactions up to a specified degree. The code y ~ (x + z + w)^2 expands to y ~ x + z + w + x:z + x:w + z:w. A place holder for all other variables in the data frame except the dependent variable- A minus sign removes a variable from the equation. For example, y ~ (x + z + w)^2 – x:w expands to y ~ x + z + w + x:z + z:w-1 Suppresses the intercept. For example, the formula y ~ x -1 fits a regression of y on x, and forces the line through the origin at x=0
The order in which the effects appear in a formula matters only when:There’s more than one factor and the design is unbalanced (i.e. the model y~A*B will not produce the same results as the model y~B*A)Covariates are present (i.e. covariates should be listed first, followed by main effects, followed by two-way interactions, and so on)
PCA was invented in 1901 by Karl Pearson. Principal component analysis is a variable reduction procedure. It is useful when you have obtained data on a number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. Thus, the objective of principal component analysis is to reduce the dimensionality of the data set and to identify meaningful underlying variables. It is more useful as a visualization tool than as an analytical method. The basic idea in PCA is to find the components that explain the maximum amount of variance in original variables by few linearly transformed uncorrelated components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
(important) scale option in the prcomp function: * TRUE: PCA based on correlation matrix * FALSE: PCA based on covariance matrix (default)PC1 = - 0.059 * wt - 0.832 * disp - 0.406 * hp + 0.369 * mpeg + 0.062 * qsecPC2 = - 0.05 * wt + 0.475 * disp - 0.832 * hp + 0.122 * mpeg + 0.255 * qsec
The angles between biplot vectors (arrows going from origin to factor loading coordinates) clearly show the relationships between the plant attributes measured during the trial (the cosine of the angle between any 2 vectors approximates their correlation). To estimate the level of any variable in any genotype, draw a perpendicular line from the genotype score to the biplot vector of interest.Reading from the biplot we can summarize as follows: while both cars 25 and 30 have roughly same level of "hp", but car 25 has much higher "disp" that car 30Note: PC1 explains > 89% of the variance in the dataset, leaving much less for PC2 to explain.
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of:* In “single” linkage method, distance between clusters is taken as distance between the closest neighbors and in “complete” linkage the distance between farthest neighbors determines distance between clusters. * “average” linkage defines the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to cluster1 and the other member of the pair belongs to cluster2. * In “centroid” linkage the distance between clusters is defined as distance between the centers of the clusters. Thus, groups once formed are represented by their mean values for each variable, that is, by their mean vector and inter-cluster distance is the distance between two such mean vectors.* In “ward” method at each step in the analysis, union of every possible pair of clusters is considered and the two clusters whose fusion results in the minimum increase in the information loss are combined. Ward defines an information loss in terms of error sum of squares (ESS) criterion.
The standard error of the mean can be estimated using the formula sd/√(n − 1), where sd is the standard deviation of the sample and n is the number of observations.The function first assesses whether missing values (values of 'NA') should be removed (based on the value of na.rm supplied by the function user). If the function is called with na.rm=TRUE, the is.na() function is used to deselect such values, before the standard deviation and length are calculated using the sd() and length() functions. Finally, the standard error of the mean is calculated and returned.Note: you can use either explicit return command, or the value returned by the function will be the value of the last statement executed.You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments."%p%" <- function(x, y) paste(x, y, sep=" ")"Hi" %p% "Khaled" # "Hi Khaled" To combine more than one value in the returned result:result <- list(xname=x, yname=y)return(result) Note: in this case if value is returned then you can check for value$xname and value$yname or value[["xname"]] and value[["yname"]]If a <- "xname" then value$a will not work, while value[[a]] will work
In the sink() function you can also:* use option append=TRUE to append text to the file rather than overwriting it.* use option split=TRUE will send output to both the screen and file.Data Output:write.table(DATA, "data.csv", quote = F, row.names = T, sep = ",")In addition to jpeg(), you can use the functions pdf(), win.metafile(), png(), bmp(), tiff(), xfig(), and postscript() to save graphics in other formats.You can run R script file non-interactively and send output to another file.R CMD BATCH [options] script.R [out-file]
sum function for example is primitive function and written in C language for performance issue and can’t be viewed in this manner, while cor function like most R functions are written in R itself.The ifelse construct is a compact and vectorized version of the if-else construct:y <- ifelse(x<0, 0, log(x))Error raised by a call to stop("your message")Warning raised by a call to warning("your message")class(mtcars) # "data.frame"typeof(mtcars)# "list"object.size(mtcars)# 5336 bytesstr(mtcars)
Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. To learn more about Sweave, visit the Sweave home page (www.stat.uni-muenchen.de/~leisch/Sweave/). To learn more about LaTeX you can start here: http://www.latex-project.org/intro.html
You can use TEXworks software to render the LaTeX tags in PDF format, TEXworks lowering the entry barrier to the TEX world, it is also free and open source package and you can get it from: http://www.tug.org/texworks/To get a valid render, xtableLaTeX output should be inserted into LaTeX document template such as the following simple one:\\documentclass{article}\\usepackage[utf8]{inputenc}\\usepackage[frenchb]{babel}\\begin{document}% Your LaTeX goes here\\end{document}
The format is: qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim=, xlab=, ylab=, main=, sub=)where the parameters/options are defined below:alphaAlpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity).data Specifies a data frame.main, sub Character vectors specifying the title and subtitle.x , y Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y.xlab, ylabCharacter vectors specifying horizontal and vertical axis labels.xlim , ylimTwo-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively.color, shape, size, fillAssociates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically.facets Creates a trellis graph by specifying conditioning variables. Its value is expressed as rowvar ~ colvar (see the example in figure 16.10). To create trellis graphs based on a single conditioning variable, use rowvar~. or .~colvar.geomSpecifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include "point", "smooth", "boxplot", "line", "histogram", "density", "bar", and "jitter".method, formulaIf geom="smooth", a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include "lm" for regression, "gam" for generalized additive models, and "rlm" for robust regression. The formula parameter gives the form of the fit.For example, to add simple linear regression lines, you’d specify: geom="smooth", method="lm", formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables.
library(ggplot2)attach(mtcars)am <- factor(am, labels=c("automatic", "manual"))qplot(wt, mpg, shape=20, color=am, main="1974 Motor Trend US magazine (Piece of cake!)", xlab="Weight (lb/1000)", ylab="Miles/(US) gallon", geom=c("point", "smooth"), method="lm")
How Large is Your Family?The reason that the estimate is wrong is that families with 0 children could not have sent any to the class! So the average calculated is a random sample when sampling by child and not by family. In this case families with a large number of children are sampled more often - one time for each child.Birthday Problem!Ṕ(n) = 1 x (1 – 1/365) x (1 – 2/365) x ... x (1 – (n – 1)/365)The equation expresses the fact that the first person has no one to share a birthday, the second person cannot have the same birthday as the first (364/365), the third cannot have the same birthday as the first two (363/365), and in general the n th birthday cannot be the same as any of the n − 1 preceding birthdays.The event of at least two of the n persons having the same birthday is complementary to all n birthdays being different. Therefore, its probability P(n) is:P(n) = 1 – Ṕ(n)This probability surpasses 1/2 for n = 23 (with value about 50.7%). For more information:http://en.wikipedia.org/wiki/Birthday_problem
Please don’t hesitate to contact us if you have any question, comment or feedback related to this session <khaled.alshamaa@gmail.com>Japanese attitude for work: If one can do it, I can do it. If no one can do it, I must do it.Middle Eastern attitude for work:Wallahi … if one can do it, let him do it. If no one can do it, ya-habibi how can I do it?