SlideShare ist ein Scribd-Unternehmen logo
1 von 95
Khaled El-Sham’aa




                    1
Session Road Map
 First Steps                 ANOVA
 Importing Data into R       PCA
 R Basics                    Clustering
 Data Visualization          Time Series
 Correlation & Regression    Programming
 t-Test                      Publication-Quality output
 Chi-squared Test




                                                            2
First Steps (1)
 R is one of the most popular platforms for data
 analysis and visualization currently available. It is
 free and open source software:
             http://www.r-project.org

 Take advantage of its coverage and availability of
 new, cutting edge applications/techniques.

 R will enable us to develop and distribute solutions
 to our NARS with no hidden license cost.
                                                         3
First Steps (2)




                  4
First Steps (3)
5 * 4                   b[4]
[1] 20                  [1] 5

a <- (3 * 7) + 1        b[1:3]
a                       [1] 1 2 3
[1] 22
                        b[c(1,3,5)]
b <- c(1, 2, 3, 5, 8)   [1] 1 3 8
b * 2
[1] 2 4 6 10 16         b[b > 4]
                        [1] 5 8

                                      5
First Steps (4)
 citation()

 R Development Core Team (2009). R: A language and
 environment for statistical computing. R Foundation
 for Statistical Computing, Vienna, Austria. ISBN
 3-900051-07-0, URL http://www.R-project.org.




                                                       6
First Steps (5)
 If you know the name of the function you want help
 with, you just type a question mark ? at the command
 line prompt followed by the name of the function:
 ?read.table




                                                        7
First Steps (6)
 Sometimes you cannot remember the precise name of
 the function, but you know the subject on which you
 want help. Use the help.search function with your
 query in double quotes like this:
 help.search("data input")




                                                       8
First Steps (7)
 To see a worked example just type the function name:
example(mean)

mean> x <- c(0:10, 50)

mean> xm <- mean(x)

mean> c(xm, mean(x, trim = 0.10))
[1] 8.75 5.50

mean> mean(USArrests, trim = 0.2)
  Murder Assault UrbanPop      Rape
    7.42   167.60    66.20    20.16
                                                         9
First Steps (8)
 There are hundreds of contributed packages for
  R, written by many different authors (to implement
  specialized statistical methods). Most are available for
  download from CRAN (http://CRAN.R-project.org)

 List all available packages:    library()
 Load package “ggplot2”:         library(ggplot2)
 Documentation on package        library(help=ggplot2)




                                                             10
Importing Data into R (1)
 data <- read.table("D:/path/file.txt", header=TRUE)

 data <- read.csv(file.choose(), header=TRUE, sep=";")

 data <- edit(data)

 fix(data)

 head(data)

 tail(data)


 tail(data, 10)



                                                          11
Importing Data into R (2)
 In order to refer to a vector by name with an R session,
 you need to attach the dataframe containing the
 vector. Alternatively, you can refer to the dataframe
 name and the vector name within it, using the element
 name operator $ like this: mtcars$mpg

 ?mtcars

 attach(mtcars)

 mpg

                                                             12
Importing Data into R (3)




                            13
Importing Data into R (4)
# Read data left on the clipboard
data <- read.table("clipboard", header=T)

# ODBC
library(RODBC)
db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd")
raw <- sqlQuery(db1, "SELECT * FROM table1")

# XLSX
library(XLConnect)
xls <- loadWorkbook("my_file.xlsx", create=F)
raw <- as.data.frame(readWorksheet(xls,sheet='Sheet1'))
                                                    14
R Basics (1)
 max(x)           maximum value in x
 min(x)           minimum value in x
 mean(x)          arithmetic average of the values in x
 median(x)        median value in x
 var(x)           sample variance of x
 sd(x)            standard deviation of x
 cor(x,y)        correlation between vectors x and y
 summary(x)      generic function used to produce
  result summaries of the results of various functions

                                                           15
R Basics (2)
 abs(x)           absolute value
   floor(2.718)   largest integers not greater than
   ceiling(3.142) smallest integer not less than x
   asin(x)        inverse sine of x in radians
   round(2.718, digits=2)       returns 2.72

 x <- 1:12; sample(x)           Simple randomization
 RCBD randomization:
    RCBD <- replicate(3, sample(x))

                                                        16
R Basics (3)
Common Data Transformation:

Nature of Data                              Transformation           R Syntax

Measurements (lengths, weights, etc)       loge                log(x)

                                           log10               log(x, 10)
                                           Log10               log10(x)
                                           Log x+1             log(x + 1)
Counts (number of individuals, etc)                            sqrt(x)
Percentages (must be proportions)          arcsin              asin(sqrt(x))*180/pi


* where x is the name of the vector (variable) whose values are to be transformed.
                                                                                     17
R Basics (4)
 Vectorized computations:
 Any function call or operator apply to a vector in will
 automatically operates directly on all elements of the
 vector.
 nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8
 The recycling rule:
 The shorter vector is replicated enough times so that the
 result has the length of the longer vector, then the
 operator is applied.
 1:10 + 1:3        # 2   4   6   5   7    9   8 10 12 11

                                                           18
R Basics (5)
mydata <- matrix(rnorm(30), nrow=6)
mydata

# calculate the 6 row means
apply(mydata, 1, mean)

# calculate the 5 column means
apply(mydata, 2, mean)

apply(mydata, 2, mean, trim=0.2)
                                      19
R Basics (6)
 String functions:

substr(month.name, 2, 3)
paste("*", month.name[1:4], "*", sep=" ")

x <- toupper(dna.seq)
rna.seq <- chartr("T", "U", x)

comp.seq <- chartr("ACTG", "TGAC", dna.seq)

                                              20
R Basics (7)
 Surprisingly, the base installation doesn’t provide
  functions for skew and kurtosis, but you can add your
  own:

  m <- mean(x)
  n <- length(x)
  s <- sd(x)

  skew <- sum((x-m)^3/s^3)/n
  kurt <- sum((x-m)^4/s^4)/n – 3


                                                          21
Data Visualization (1)
 Pairs for a matrix
  of scatter plots
  of every variable
  against every
  other:

  ?mtcars
  pairs(mtcars)

  Voilà!


                         22
Data Visualization (2)
pie(table(cyl))   barplot(table(cyl))




                                        23
Data Visualization (3)
 Gives a scatter plot if x is continuous, and a box-and-
  whisker plot if x is a factor. Some people prefer the
  alternative syntax plot(y~x):

  attach(mtcars)
  plot(wt, mpg)

  plot(cyl, mpg)

  cyl <- factor(cyl)
  plot(cyl, mpg)

                                                            24
Data Visualization (4)




                         25
Data Visualization (5)
 Histograms show a frequency distribution
 hist(qsec, col="gray")




                                             26
Data Visualization (6)
 boxplot(qsec, col="gray")

 boxplot(qsec, mpg, col="gray")




                                   27
Data Visualization (7)
XY <- cbind(LAT, LONG)
plot(XY, type='l')

library(sp)
XY.poly <- Polygon(XY)

XY.pnt <- spsample(XY.poly,
          n=8, type='random')

XY.pnt

points(XY.pnt)
                                28
Data Visualization (8)




                         29
Correlation and Regression (1)
 If you want to determine the significance of a
 correlation (i.e. the p value associated with the
 calculated value of r) then use cor.test rather than cor.

 cor(wt, mpg)
 [1] -0.8676594


 The value will vary from -1 to +1. A -1 indicates perfect
 negative correlation, and +1 indicates perfect positive
 correlation. 0 means no correlation.
                                                             30
Correlation and Regression (2)
cor.test(wt, qsec)
        Pearson's product-moment correlation

data: wt and qsec
t = -0.9719, df = 30, p-value = 0.3389
alternative hypothesis: true correlation is not
  equal to 0
95 percent confidence interval:
 -0.4933536 0.1852649
sample estimates:
       cor
-0.1747159
                                                  31
Correlation and Regression (3)
cor.test(wt, mpg)
        Pearson's product-moment correlation

data: wt and mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not
  equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor
-0.8676594
                                                  32
Correlation and Regression (4)




                                 33
Correlation and Regression (5)
 Fits a linear model with normal errors and constant
 variance; generally this is used for regression analysis
 using continuous explanatory variables.

  fit <- lm(y ~ x)
  summary(fit)
  plot(x, y)

  # Sample of multiple linear regression
  fit <- lm(y ~ x1 + x2 + x3)

                                                            34
Correlation and Regression (6)
Call:
lm(formula = mpg ~ wt)

Residuals:
    Min      1Q Median        3Q      Max
-4.5432 -2.3647 -0.1252   1.4096   6.8727

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851      1.8776 19.858 < 2e-16 ***
wt           -5.3445     0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528,     Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

                                                                35
Correlation and Regression (7)
 The great thing about graphics in R is that it is
  extremely straightforward to add things to your plots.
  In the present case, we might want to add a regression
  line through the cloud of data points. The function for
  this is abline which can take as its argument the linear
  model object:
                        abline(fit)


 Note: abline(a, b) function adds a regression line
  with an intercept of a and a slope of b
                                                         36
Correlation and Regression (8)
plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon")
abline(fit, col="blue", lwd=2)
text(4, 25, "mpg = 37.29 - 5.34 wt")




                                                    37
Correlation and Regression (9)
 Predict is a generic built-in function for predictions
  from the results of various model fitting functions:

  predict(fit, list(wt = 4.5))
  [1] 13.23500




                                                           38
Correlation and Regression (10)




                              39
Correlation and Regression (11)
 What do you do if you identify problems?

 There are four approaches to dealing with violations of
 regression assumptions:

   Deleting observation
   Transforming variables
   Adding or deleting variables
   Using another regression approach


                                                       40
Correlation and Regression (12)
 You can compare the fit of two nested models using
  the anova() function in the base installation. A nested
  model is one whose terms are completely included in
  the other model.
                            fit1 <- lm (y ~ A + B + C)
                            fit2 <- lm (y ~ A + C)
                            anova(fit1, fit2)


 If the test is not significant (i.e. p > 0.05), we conclude
  that B in this case don’t add to the linear prediction
  and we’re justified in dropping it from our model.
                                                                41
Correlation and Regression (13)
# Bootstrap 95% CI for R-Squared
library(boot)

rsq <- function(formula, data, indices) {
    fit <- lm(formula, data= data[indices,])
    return(summary(fit)$r.square)
}

rs <- boot(data=mtcars, statistic=rsq, R=1000,
           formula=mpg~wt+disp)

boot.ci(rs, type="bca") # try print(rs) and plot(rs)
                                                       42
t-Test (1)
 Comparing two sample means with normal errors
 (Student’s t test, t.test)
 t.test(a, b)
 t.test(a, b, paired = TRUE)
 # alternative argument options:
 # "two.sided", "less", "greater"

 a <- qsec[cyl == 4]
 b <- qsec[cyl == 6]
 c <- qsec[cyl == 8]

                                                  43
t-Test (2)
t.test(a, b)
        Welch Two Sample t-test

data: a and b
t = 1.4136, df = 12.781, p-value = 0.1814
alternative hypothesis: true difference in means is
  not equal to 0
95 percent confidence interval:
 -0.6159443 2.9362040
sample estimates:
mean of x mean of y
 19.13727 17.97714
                                                      44
t-Test (3)
t.test(a, c)
        Welch Two Sample t-test

data: a and c
t = 3.9446, df = 17.407, p-value = 0.001005
alternative hypothesis: true difference in means is
  not equal to 0
95 percent confidence interval:
 1.102361 3.627899
sample estimates:
mean of x mean of y
 19.13727 16.77214
                                                      45
t-Test (4)
(a) Test the equality of variances assumption:

ev <- var.test(a, c)$p.value


(b) Test the normality assumption:

an <- shapiro.test(a)$p.value
bn <- shapiro.test(c)$p.value


                                                 46
Chi-squared Test (1)
Construct hypotheses based on qualitative – categorical data:

myTable <- table(am, cyl)

myTable
           cyl
am           4        6 8
  automatic 3         4 12
  manual     8        3 2
                                                            47
Chi-squared Test (2)
chisq.test(myTable)
        Pearson's Chi-squared test

data: myTable
X-squared = 8.7407, df = 2, p-value = 0.01265

The expected counts under the null hypothesis:

hisq.test(myTable)$expected
           cyl
am                4       6      8
  automatic 6.53125 4.15625 8.3125
  manual    4.46875 2.84375 5.6875
                                                 48
Chi-squared Test (3)
mosaicplot(myTable, color=rainbow(3))




                                        49
ANOVA (1)
 A method which partitions the total variation in the
 response into the components (sources of variation) in
 the above model is called the analysis of variance.

 table(N, S, Rep)

 N <- factor(N)
 S <- factor(S)
 Rep <- factor(Rep)


                                                         50
ANOVA (2)
 The best way to
 understand the two
 significant interaction
 terms is to plot them using
 interaction.plot like this:

interaction.plot(S, N, Yield)




                                51
ANOVA (3)
boxplot(Yield~N, col="gray")




                               52
ANOVA (4)
model <- aov(Yield ~ N * S)                    #CRD
summary(model)
            Df Sum Sq Mean Sq F value   Pr(>F)
N            2 4.5818 2.2909 42.7469 1.230e-08 ***
S            3 0.9798 0.3266 6.0944 0.003106 **
N:S          6 0.6517 0.1086 2.0268 0.101243
Residuals   24 1.2862 0.0536
---
Signif. codes:   0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1




                                                                  53
ANOVA (5)
par(mfrow = c(2, 2))

plot(model)

ANOVA assumptions:
 Normality
 Linearity
 Constant variance
 Independence

                       54
ANOVA (6)
model.tables(model, "means")
Tables of means
Grand mean
1.104722

N    0    180    230
0.6025 1.3142 1.3975

S    0     10     20     40
0.8289 1.1556 1.1678 1.2667

       S
N       0        10       20       40
    0   0.5600   0.7733   0.5233   0.5533
    180 0.8933   1.2900   1.5267   1.5467
    230 1.0333   1.4033   1.4533   1.7000
                                            55
ANOVA (7)
model.tables(model, se=TRUE)
.......
Standard errors for differences of means
             N      S    N:S
        0.0945 0.1091 0.1890
replic.     12      9      3



Plot.design(Yield ~ N * S)




                                           56
ANOVA (8)
mc <- TukeyHSD(model, "N", ordered = TRUE); mc
  Tukey multiple comparisons of means
    95% family-wise confidence level
    factor levels have been ordered

Fit: aov(formula = Yield ~ N * S)

$N
              diff        lwr       upr     p adj
180-0   0.71166667 0.4756506 0.9476827 0.0000003
230-0   0.79500000 0.5589840 1.0310160 0.0000000
230-180 0.08333333 -0.1526827 0.3193494 0.6567397
                                                    57
ANOVA (9)
plot(mc)




            58
ANOVA (10)
summary(aov(Yield ~ N * S + Error(Rep)))             #RCB
Error: Rep
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 2 0.30191 0.15095

Error: Within
          Df Sum Sq Mean Sq F value   Pr(>F)
N          2 4.5818 2.2909 51.2035 5.289e-09 ***
S          3 0.9798 0.3266 7.3001 0.001423 **
N:S        6 0.6517 0.1086 2.4277 0.059281 .
Residuals 22 0.9843 0.0447
---
Signif. codes:   0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
                                                                  59
ANOVA (11)
 In a split-plot design, different treatments are applied
  to plots of different sizes. Each different plot size is
  associated with its own error variance.
 The model formula is specified as a factorial, using the
  asterisk notation. The error structure is defined in the
  Error term, with the plot sizes listed from left to
  right, from largest to smallest, with each variable
  separated by the slash operator /.
  model <- aov(Yield ~ N * S + Error(Rep/N))

                                                             60
ANOVA (12)
Error: Rep
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 2 0.30191 0.15095

Error: Rep:N
          Df Sum Sq Mean Sq F value  Pr(>F)
N          2 4.5818 2.29088 55.583 0.001206 **
Residuals 4 0.1649 0.04122

Error: Within
          Df Sum Sq    Mean Sq F value  Pr(>F)
S          3 0.97983   0.32661 7.1744 0.002280 **
N:S        6 0.65171   0.10862 2.3860 0.071313 .
Residuals 18 0.81943   0.04552
                                                    61
ANOVA (13)
 Analysis of Covariance:

 # f is treatment factor
 # x is variate acts as covariate
 model <- aov(y ~ x * f)


 Split both main effects into linear and quadratic parts.

 contrasts <- list(N = list(lin=1, quad=2),
                   S = list(lin=1, quad=2))
 summary(model, split=contrasts)

                                                             62
PCA (1)
 The idea of principal components analysis (PCA) is to
 find a small number of linear combinations of the
 variables so as to capture most of the variation in the
 dataframe as a whole.

d2 <- cbind(wt, disp/10, hp/10, mpg, qsec)

colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec")




                                                           63
PCA (2)
model <- prcomp(d2)
model
Standard deviations:
[1] 14.6949595 3.9627722   2.8306355   1.1593717

Rotation:
             PC1         PC2        PC3           PC4
wt   -0.05887539 0.05015401 -0.07513271   -0.16910728
disp -0.83186362 0.47519625 0.28005113     0.04080894
hp   -0.40572567 -0.83180078 0.24611265   -0.28768795
mpeg 0.36888799 0.12190490 0.91398919     -0.09385946
qsec 0.06200759 0.25479354 -0.14134625    -0.93710373
                                                    64
PCA (3)
summary(model)

Importance of components:
                          PC1    PC2     PC3
Standard deviation     14.6950 3.96277 2.83064
Proportion of Variance 0.8957 0.06514 0.03323
Cumulative Proportion   0.8957 0.96082 0.99405




                                                 65
PCA (4)
plot(model)   biplot(model)




                              66
Clustering (1)
 We define similarity on the basis of the distance
 between two samples in this n-dimensional space.
 Several different distance measures could be used to
 work out the distance from every sample to every other
 sample. This quantitative dissimilarity structure of the
 data is stored in a matrix produced by the dist function:

 rownames(d2) <- rownames(mtcars)

 my.dist <- dist(d2, method="euclidian")

                                                        67
Clustering (2)
 Initially, each sample is assigned to its own cluster, and
  then the hclust algorithm proceeds iteratively, at each
  stage joining the two most similar clusters, continuing
  until there is just a single cluster (see ?hclust for
  details).

  my.hc <- hclust(my.dist, "ward")




                                                           68
Clustering (3)
 We can plot the object called my.hc, and we specify
 that the leaves of the hierarchy are labeled by their
 plot numbers

 plot(my.hc, hang=-1)

 g <- rect.hclust(my.hc, k=4, border="red")

 Note:
 When the hang argument is set to '-1' then all leaves
 end on one line and their labels hang down from 0.
                                                         69
Clustering (4)




                 70
Clustering (5)
 Partitioning into a number of clusters specified by the user.

gr <- kmeans(cbind(disp, hp), 2)

plot(disp, hp, col = gr$cluster, pch=19)

points(gr$centers, col = 1:2, pch = 8, cex=2)




                                                              71
Clustering (6)




                 72
Clustering (7)
K-means clustering with 2 clusters of sizes 18, 14

Cluster means:
      disp        hp
1 135.5389 98.05556
2 353.1000 209.21429

Clustering vector:
 [1] 1 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1
  1 2 1 2 1

Within cluster sum of squares by cluster:
[1] 58369.27 93490.74
 (between_SS / total_SS = 75.6 %)
                                                             73
Clustering (8)
x <- as.matrix(mtcars)
heatmap(x, scale="column")




                             74
Time Series (1)
 First, make the data variable into a time series object

  # create time-series objects
  beer <- ts(beer, start=1956, freq=12)


 It is useful to be able to turn a time series into
  components. The function stl performs seasonal
  decomposition of a time series into seasonal, trend
  and irregular components using loess.


                                                            75
Time Series (2)
 The remainder component is the residuals from the
 seasonal plus trend fit. The bars at the right-hand side
 are of equal heights (in user coordinates).

 # Decompose a time series into seasonal,
 # trend and irregular components using loess
 ts.comp <- stl(beer, s.window="periodic")

 plot(ts.comp)


                                                        76
Time Series (3)




                  77
Programming (1)
 We can extend the functionality of R by writing a
 function that estimates the standard error of the mean

 SEM <- function(x, na.rm = FALSE) {
     if (na.rm == TRUE) VAR <- x[!is.na(x)]
     else VAR <- x
     SD <- sd(VAR)
     N <- length(VAR)
     SE <- SD/sqrt(N - 1)
     return(SE)
 }
                                                      78
Programming (2)
 You can define your own operator of the form %any%
 using any text string in place of any. The function
 should be a function of two arguments.

 "%p%" <- function(x,y) paste(x,y,sep=" ")

 "Hi" %p% "Khaled"

 [1] "Hi Khaled"



                                                       79
Programming (3)
setwd("path/to/folder")
sink("output.txt")
  cat("Intercept t Slope")
  a <- fit$coefficients[[1]]
  b <- fit$coefficients[[2]]
  cat(paste(a, b, sep="t"))
sink()

jpeg(filename="graph.jpg", width=600, height=600)
plot(wt, mpg); abline(fit)
dev.off()
                                               80
Programming (4)
 The code for R functions can be viewed, and in most
 cases modified, if so is desired using fix() function.

 You can trigger garbage collection by call gc() function
 which will report few memory usage statistics.

 Basic tool for code timing is: system.time(commands)

 tempfile() give a unique file name in temporary
 writable directory deleted at the end of the session.
                                                             81
Programming (5)
 Take control of your R code! RStudio is a free and open
 source integrated development environment for R. You
 can run it on your desktop (Windows, Mac, or Linux) :

   Syntax highlighting, code completion, etc...
   Execute R code directly from the source editor
   Workspace browser and data viewer
   Plot history, zooming, and flexible image & PDF export
   Integrated R help and documentation
   and more (http://www.rstudio.com/ide/)

                                                             82
Programming (6)




                  83
Programming (7)
 If want to evaluate the quadratic x2−2x +4 many times
 so we can write a function that evaluates the function
 for a specific value of x:

  my.f <- function(x) { x^2 - 2*x + 4 }

  my.f(3)
  [1] 7

  plot(my.f, -10, +10)
                                                          84
Programming (8)




                  85
Programming (9)
 We can find the minimum of the function using:

  optimize(my.f, lower = -10, upper = 10)
  $minimum
  [1] 1
  $objective
  [1] 3


 which says that the minimum occurs at x=1 and at that
 point the quadratic has value 3.
                                                         86
Programming (10)
 We can integrate the function over the interval -10 to
 10 using:

  integrate(my.f, lower = -10, upper = 10)
  746.6667 with absolute error < 4.1e-12


 which gives an answer together with an estimate of the
 absolute error.


                                                           87
Programming (11)
plot(my.f, -15, +15)

v <- seq(-10,10,0.01)

x <- c(-10,v,10)
y <- c(0,my.f(v),0)

polygon(x, y,
       col='gray')
                        88
Publication-Quality Output (1)
 Research doesn’t end when the last statistical analysis
 is completed. We need to include the results in a
 report. xtable function convert an R object to an xtable
 object, which can then be printed as a LaTeX table.

 LaTeX is a document preparation system for high-
 quality typesetting (http://www.latex-project.org).

library(xtable)
print(xtable(model))

                                                            89
Publication-Quality Output (2)
library(xtable)
example(aov)
print(xtable(npk.aov))




                                 90
Publication-Quality Output (3)
 ggplot2 package is an elegant alternative to the base
 graphics system, it has two complementary uses:

   Producing publication quality graphics using very
    simple syntax that it similar to that of base graphics.
    ggplot2 tends to make smart default choices for color,
    scale etc.

   Making more sophisticated/customized plots that go
    beyond the defaults.

                                                              91
Publication-Quality Output (4)




                                 92
Final words!
 How Large is Your Family?
 How many brothers and sisters are there in your family
 including yourself? The average number of children in
 families was about 2. Can you explain the difference
 between this value and the class average?

 Birthday Problem!
 The problem is to compute the approximate
 probability that in a room of n people, at least two
 have the same birthday.

                                                        93
Online Resources
 http://tryr.codeschool.com

 http://www.r-project.org

 http://www.statmethods.net

 http://www.r-bloggers.com

 http://www.r-tutor.com

 http://blog.revolutionanalytics.com/r
                                          94
Thank You




            95

Weitere ähnliche Inhalte

Was ist angesagt?

Data visualization using R
Data visualization using RData visualization using R
Data visualization using RUmmiya Mohammedi
 
Searching linear &amp; binary search
Searching linear &amp; binary searchSearching linear &amp; binary search
Searching linear &amp; binary searchnikunjandy
 
statistical computation using R- an intro..
statistical computation using R- an intro..statistical computation using R- an intro..
statistical computation using R- an intro..Kamarudheen KV
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programmingizahn
 
Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using rTahera Shaikh
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2izahn
 
R programming slides
R  programming slidesR  programming slides
R programming slidesPankaj Saini
 
R Programming: Introduction to Matrices
R Programming: Introduction to MatricesR Programming: Introduction to Matrices
R Programming: Introduction to MatricesRsquared Academy
 
MySQL 5.7 String Functions
MySQL 5.7 String FunctionsMySQL 5.7 String Functions
MySQL 5.7 String FunctionsFrancesco Marino
 
Looping Statements and Control Statements in Python
Looping Statements and Control Statements in PythonLooping Statements and Control Statements in Python
Looping Statements and Control Statements in PythonPriyankaC44
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With REdureka!
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with RYanchang Zhao
 
Python Variable Types, List, Tuple, Dictionary
Python Variable Types, List, Tuple, DictionaryPython Variable Types, List, Tuple, Dictionary
Python Variable Types, List, Tuple, DictionarySoba Arjun
 

Was ist angesagt? (20)

Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 
Searching linear &amp; binary search
Searching linear &amp; binary searchSearching linear &amp; binary search
Searching linear &amp; binary search
 
statistical computation using R- an intro..
statistical computation using R- an intro..statistical computation using R- an intro..
statistical computation using R- an intro..
 
Introduction to R Programming
Introduction to R ProgrammingIntroduction to R Programming
Introduction to R Programming
 
Exploratory data analysis using r
Exploratory data analysis using rExploratory data analysis using r
Exploratory data analysis using r
 
Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2Introduction to R Graphics with ggplot2
Introduction to R Graphics with ggplot2
 
R programming
R programmingR programming
R programming
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
R programming slides
R  programming slidesR  programming slides
R programming slides
 
Python programming : Strings
Python programming : StringsPython programming : Strings
Python programming : Strings
 
R Programming: Introduction to Matrices
R Programming: Introduction to MatricesR Programming: Introduction to Matrices
R Programming: Introduction to Matrices
 
MySQL 5.7 String Functions
MySQL 5.7 String FunctionsMySQL 5.7 String Functions
MySQL 5.7 String Functions
 
Looping Statements and Control Statements in Python
Looping Statements and Control Statements in PythonLooping Statements and Control Statements in Python
Looping Statements and Control Statements in Python
 
Relational algebra in dbms
Relational algebra in dbmsRelational algebra in dbms
Relational algebra in dbms
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
Data Clustering with R
Data Clustering with RData Clustering with R
Data Clustering with R
 
Class ppt intro to r
Class ppt intro to rClass ppt intro to r
Class ppt intro to r
 
Python Variable Types, List, Tuple, Dictionary
Python Variable Types, List, Tuple, DictionaryPython Variable Types, List, Tuple, Dictionary
Python Variable Types, List, Tuple, Dictionary
 
Programming in R
Programming in RProgramming in R
Programming in R
 
Python : Data Types
Python : Data TypesPython : Data Types
Python : Data Types
 

Andere mochten auch

An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)Dataspora
 
R language Project report
R language Project reportR language Project report
R language Project reportTianyue Wang
 
Classification Model - Decision Tree
Classification Model -  Decision TreeClassification Model -  Decision Tree
Classification Model - Decision TreeVaibhav Jain
 
Classification model for predicting student's knowledge
Classification model for predicting student's knowledgeClassification model for predicting student's knowledge
Classification model for predicting student's knowledgeAshish Ranjan
 
IDS alert classification model
IDS alert classification modelIDS alert classification model
IDS alert classification modeldilipjangam91
 
Simple Business Model Classification System: Business Model Pipes, Valleys, a...
Simple Business Model Classification System: Business Model Pipes, Valleys, a...Simple Business Model Classification System: Business Model Pipes, Valleys, a...
Simple Business Model Classification System: Business Model Pipes, Valleys, a...Rod King, Ph.D.
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching moduleSander Timmer
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonChetan Khatri
 
R programming Basic & Advanced
R programming Basic & AdvancedR programming Basic & Advanced
R programming Basic & AdvancedSohom Ghosh
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 ClassificationKhalid Elshafie
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Paytm Education Presentation
Paytm Education PresentationPaytm Education Presentation
Paytm Education PresentationAbhishek Bhatt
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysisDataminingTools Inc
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Sample project abstract
Sample project abstractSample project abstract
Sample project abstractklezeh
 

Andere mochten auch (20)

LSESU a Taste of R Language Workshop
LSESU a Taste of R Language WorkshopLSESU a Taste of R Language Workshop
LSESU a Taste of R Language Workshop
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 
R language introduction
R language introductionR language introduction
R language introduction
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 
R language Project report
R language Project reportR language Project report
R language Project report
 
Classification Model - Decision Tree
Classification Model -  Decision TreeClassification Model -  Decision Tree
Classification Model - Decision Tree
 
Classification model for predicting student's knowledge
Classification model for predicting student's knowledgeClassification model for predicting student's knowledge
Classification model for predicting student's knowledge
 
IDS alert classification model
IDS alert classification modelIDS alert classification model
IDS alert classification model
 
Simple Business Model Classification System: Business Model Pipes, Valleys, a...
Simple Business Model Classification System: Business Model Pipes, Valleys, a...Simple Business Model Classification System: Business Model Pipes, Valleys, a...
Simple Business Model Classification System: Business Model Pipes, Valleys, a...
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 
R programming Basic & Advanced
R programming Basic & AdvancedR programming Basic & Advanced
R programming Basic & Advanced
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Paytm Education Presentation
Paytm Education PresentationPaytm Education Presentation
Paytm Education Presentation
 
Data Mining: Classification and analysis
Data Mining: Classification and analysisData Mining: Classification and analysis
Data Mining: Classification and analysis
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Sample project abstract
Sample project abstractSample project abstract
Sample project abstract
 

Ähnlich wie R Language Introduction

R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavVyacheslav Arbuzov
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environmentYogendra Chaubey
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examplesDennis
 
Loops and functions in r
Loops and functions in rLoops and functions in r
Loops and functions in rmanikanta361
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply FunctionSakthi Dasans
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_publicLong Nguyen
 
R Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfR Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfTimothy McBush Hiele
 
BUilt in Functions and Simple programs in R.pdf
BUilt in Functions and Simple programs in R.pdfBUilt in Functions and Simple programs in R.pdf
BUilt in Functions and Simple programs in R.pdfkarthikaparthasarath
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisSilvio Cesare
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2rampan
 

Ähnlich wie R Language Introduction (20)

Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
Perm winter school 2014.01.31
Perm winter school 2014.01.31Perm winter school 2014.01.31
Perm winter school 2014.01.31
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov VyacheslavSeminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
Seminar PSU 09.04.2013 - 10.04.2013 MiFIT, Arbuzov Vyacheslav
 
An Intoduction to R
An Intoduction to RAn Intoduction to R
An Intoduction to R
 
R tutorial for a windows environment
R tutorial for a windows environmentR tutorial for a windows environment
R tutorial for a windows environment
 
R programming intro with examples
R programming intro with examplesR programming intro with examples
R programming intro with examples
 
Loops and functions in r
Loops and functions in rLoops and functions in r
Loops and functions in r
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function4 R Tutorial DPLYR Apply Function
4 R Tutorial DPLYR Apply Function
 
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
NCCU: Statistics in the Criminal Justice System, R basics and Simulation - Pr...
 
Learn Matlab
Learn MatlabLearn Matlab
Learn Matlab
 
Practical data science_public
Practical data science_publicPractical data science_public
Practical data science_public
 
R Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdfR Cheat Sheet for Data Analysts and Statisticians.pdf
R Cheat Sheet for Data Analysts and Statisticians.pdf
 
Seminar psu 20.10.2013
Seminar psu 20.10.2013Seminar psu 20.10.2013
Seminar psu 20.10.2013
 
3 analysis.gtm
3 analysis.gtm3 analysis.gtm
3 analysis.gtm
 
BUilt in Functions and Simple programs in R.pdf
BUilt in Functions and Simple programs in R.pdfBUilt in Functions and Simple programs in R.pdf
BUilt in Functions and Simple programs in R.pdf
 
Matlab1
Matlab1Matlab1
Matlab1
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Microsoft Word Practice Exercise Set 2
Microsoft Word   Practice Exercise Set 2Microsoft Word   Practice Exercise Set 2
Microsoft Word Practice Exercise Set 2
 

Mehr von Khaled Al-Shamaa

Mehr von Khaled Al-Shamaa (9)

PHP and Arabic Language Project
PHP and Arabic Language ProjectPHP and Arabic Language Project
PHP and Arabic Language Project
 
Advanced Excel, Day 5
Advanced Excel, Day 5Advanced Excel, Day 5
Advanced Excel, Day 5
 
Advanced Excel, Day 4
Advanced Excel, Day 4Advanced Excel, Day 4
Advanced Excel, Day 4
 
Advanced Excel, Day 3
Advanced Excel, Day 3Advanced Excel, Day 3
Advanced Excel, Day 3
 
Advanced Excel, Day 2
Advanced Excel, Day 2Advanced Excel, Day 2
Advanced Excel, Day 2
 
Advanced Excel, Day 1
Advanced Excel, Day 1Advanced Excel, Day 1
Advanced Excel, Day 1
 
PHP Developer Tools - Arabic
PHP Developer Tools - ArabicPHP Developer Tools - Arabic
PHP Developer Tools - Arabic
 
Ar-PHP.org
Ar-PHP.orgAr-PHP.org
Ar-PHP.org
 
CVS (Concurrent Versions System) in Arabic
CVS (Concurrent Versions System) in ArabicCVS (Concurrent Versions System) in Arabic
CVS (Concurrent Versions System) in Arabic
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

R Language Introduction

  • 2. Session Road Map  First Steps  ANOVA  Importing Data into R  PCA  R Basics  Clustering  Data Visualization  Time Series  Correlation & Regression  Programming  t-Test  Publication-Quality output  Chi-squared Test 2
  • 3. First Steps (1)  R is one of the most popular platforms for data analysis and visualization currently available. It is free and open source software: http://www.r-project.org  Take advantage of its coverage and availability of new, cutting edge applications/techniques.  R will enable us to develop and distribute solutions to our NARS with no hidden license cost. 3
  • 5. First Steps (3) 5 * 4 b[4] [1] 20 [1] 5 a <- (3 * 7) + 1 b[1:3] a [1] 1 2 3 [1] 22 b[c(1,3,5)] b <- c(1, 2, 3, 5, 8) [1] 1 3 8 b * 2 [1] 2 4 6 10 16 b[b > 4] [1] 5 8 5
  • 6. First Steps (4)  citation() R Development Core Team (2009). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org. 6
  • 7. First Steps (5)  If you know the name of the function you want help with, you just type a question mark ? at the command line prompt followed by the name of the function: ?read.table 7
  • 8. First Steps (6)  Sometimes you cannot remember the precise name of the function, but you know the subject on which you want help. Use the help.search function with your query in double quotes like this: help.search("data input") 8
  • 9. First Steps (7)  To see a worked example just type the function name: example(mean) mean> x <- c(0:10, 50) mean> xm <- mean(x) mean> c(xm, mean(x, trim = 0.10)) [1] 8.75 5.50 mean> mean(USArrests, trim = 0.2) Murder Assault UrbanPop Rape 7.42 167.60 66.20 20.16 9
  • 10. First Steps (8)  There are hundreds of contributed packages for R, written by many different authors (to implement specialized statistical methods). Most are available for download from CRAN (http://CRAN.R-project.org)  List all available packages: library()  Load package “ggplot2”: library(ggplot2)  Documentation on package library(help=ggplot2) 10
  • 11. Importing Data into R (1)  data <- read.table("D:/path/file.txt", header=TRUE)  data <- read.csv(file.choose(), header=TRUE, sep=";")  data <- edit(data)  fix(data)  head(data)  tail(data)  tail(data, 10) 11
  • 12. Importing Data into R (2)  In order to refer to a vector by name with an R session, you need to attach the dataframe containing the vector. Alternatively, you can refer to the dataframe name and the vector name within it, using the element name operator $ like this: mtcars$mpg ?mtcars attach(mtcars) mpg 12
  • 14. Importing Data into R (4) # Read data left on the clipboard data <- read.table("clipboard", header=T) # ODBC library(RODBC) db1 <- odbcConnect("MY_DB", uid="usr", pwd="pwd") raw <- sqlQuery(db1, "SELECT * FROM table1") # XLSX library(XLConnect) xls <- loadWorkbook("my_file.xlsx", create=F) raw <- as.data.frame(readWorksheet(xls,sheet='Sheet1')) 14
  • 15. R Basics (1)  max(x) maximum value in x  min(x) minimum value in x  mean(x) arithmetic average of the values in x  median(x) median value in x  var(x) sample variance of x  sd(x) standard deviation of x  cor(x,y) correlation between vectors x and y  summary(x) generic function used to produce result summaries of the results of various functions 15
  • 16. R Basics (2)  abs(x) absolute value  floor(2.718) largest integers not greater than  ceiling(3.142) smallest integer not less than x  asin(x) inverse sine of x in radians  round(2.718, digits=2) returns 2.72  x <- 1:12; sample(x) Simple randomization  RCBD randomization: RCBD <- replicate(3, sample(x)) 16
  • 17. R Basics (3) Common Data Transformation: Nature of Data Transformation R Syntax Measurements (lengths, weights, etc) loge log(x) log10 log(x, 10) Log10 log10(x) Log x+1 log(x + 1) Counts (number of individuals, etc) sqrt(x) Percentages (must be proportions) arcsin asin(sqrt(x))*180/pi * where x is the name of the vector (variable) whose values are to be transformed. 17
  • 18. R Basics (4)  Vectorized computations: Any function call or operator apply to a vector in will automatically operates directly on all elements of the vector. nchar(month.name) # 7 8 5 5 3 4 4 6 9 7 8 8  The recycling rule: The shorter vector is replicated enough times so that the result has the length of the longer vector, then the operator is applied. 1:10 + 1:3 # 2 4 6 5 7 9 8 10 12 11 18
  • 19. R Basics (5) mydata <- matrix(rnorm(30), nrow=6) mydata # calculate the 6 row means apply(mydata, 1, mean) # calculate the 5 column means apply(mydata, 2, mean) apply(mydata, 2, mean, trim=0.2) 19
  • 20. R Basics (6)  String functions: substr(month.name, 2, 3) paste("*", month.name[1:4], "*", sep=" ") x <- toupper(dna.seq) rna.seq <- chartr("T", "U", x) comp.seq <- chartr("ACTG", "TGAC", dna.seq) 20
  • 21. R Basics (7)  Surprisingly, the base installation doesn’t provide functions for skew and kurtosis, but you can add your own: m <- mean(x) n <- length(x) s <- sd(x) skew <- sum((x-m)^3/s^3)/n kurt <- sum((x-m)^4/s^4)/n – 3 21
  • 22. Data Visualization (1)  Pairs for a matrix of scatter plots of every variable against every other: ?mtcars pairs(mtcars) Voilà! 22
  • 23. Data Visualization (2) pie(table(cyl)) barplot(table(cyl)) 23
  • 24. Data Visualization (3)  Gives a scatter plot if x is continuous, and a box-and- whisker plot if x is a factor. Some people prefer the alternative syntax plot(y~x): attach(mtcars) plot(wt, mpg) plot(cyl, mpg) cyl <- factor(cyl) plot(cyl, mpg) 24
  • 26. Data Visualization (5)  Histograms show a frequency distribution hist(qsec, col="gray") 26
  • 27. Data Visualization (6)  boxplot(qsec, col="gray")  boxplot(qsec, mpg, col="gray") 27
  • 28. Data Visualization (7) XY <- cbind(LAT, LONG) plot(XY, type='l') library(sp) XY.poly <- Polygon(XY) XY.pnt <- spsample(XY.poly, n=8, type='random') XY.pnt points(XY.pnt) 28
  • 30. Correlation and Regression (1)  If you want to determine the significance of a correlation (i.e. the p value associated with the calculated value of r) then use cor.test rather than cor. cor(wt, mpg) [1] -0.8676594 The value will vary from -1 to +1. A -1 indicates perfect negative correlation, and +1 indicates perfect positive correlation. 0 means no correlation. 30
  • 31. Correlation and Regression (2) cor.test(wt, qsec) Pearson's product-moment correlation data: wt and qsec t = -0.9719, df = 30, p-value = 0.3389 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.4933536 0.1852649 sample estimates: cor -0.1747159 31
  • 32. Correlation and Regression (3) cor.test(wt, mpg) Pearson's product-moment correlation data: wt and mpg t = -9.559, df = 30, p-value = 1.294e-10 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9338264 -0.7440872 sample estimates: cor -0.8676594 32
  • 34. Correlation and Regression (5)  Fits a linear model with normal errors and constant variance; generally this is used for regression analysis using continuous explanatory variables. fit <- lm(y ~ x) summary(fit) plot(x, y) # Sample of multiple linear regression fit <- lm(y ~ x1 + x2 + x3) 34
  • 35. Correlation and Regression (6) Call: lm(formula = mpg ~ wt) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 35
  • 36. Correlation and Regression (7)  The great thing about graphics in R is that it is extremely straightforward to add things to your plots. In the present case, we might want to add a regression line through the cloud of data points. The function for this is abline which can take as its argument the linear model object: abline(fit)  Note: abline(a, b) function adds a regression line with an intercept of a and a slope of b 36
  • 37. Correlation and Regression (8) plot(wt, mpg, xlab="Weight", ylab="Miles/Gallon") abline(fit, col="blue", lwd=2) text(4, 25, "mpg = 37.29 - 5.34 wt") 37
  • 38. Correlation and Regression (9)  Predict is a generic built-in function for predictions from the results of various model fitting functions: predict(fit, list(wt = 4.5)) [1] 13.23500 38
  • 40. Correlation and Regression (11)  What do you do if you identify problems? There are four approaches to dealing with violations of regression assumptions:  Deleting observation  Transforming variables  Adding or deleting variables  Using another regression approach 40
  • 41. Correlation and Regression (12)  You can compare the fit of two nested models using the anova() function in the base installation. A nested model is one whose terms are completely included in the other model. fit1 <- lm (y ~ A + B + C) fit2 <- lm (y ~ A + C) anova(fit1, fit2)  If the test is not significant (i.e. p > 0.05), we conclude that B in this case don’t add to the linear prediction and we’re justified in dropping it from our model. 41
  • 42. Correlation and Regression (13) # Bootstrap 95% CI for R-Squared library(boot) rsq <- function(formula, data, indices) { fit <- lm(formula, data= data[indices,]) return(summary(fit)$r.square) } rs <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp) boot.ci(rs, type="bca") # try print(rs) and plot(rs) 42
  • 43. t-Test (1)  Comparing two sample means with normal errors (Student’s t test, t.test) t.test(a, b) t.test(a, b, paired = TRUE) # alternative argument options: # "two.sided", "less", "greater" a <- qsec[cyl == 4] b <- qsec[cyl == 6] c <- qsec[cyl == 8] 43
  • 44. t-Test (2) t.test(a, b) Welch Two Sample t-test data: a and b t = 1.4136, df = 12.781, p-value = 0.1814 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.6159443 2.9362040 sample estimates: mean of x mean of y 19.13727 17.97714 44
  • 45. t-Test (3) t.test(a, c) Welch Two Sample t-test data: a and c t = 3.9446, df = 17.407, p-value = 0.001005 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.102361 3.627899 sample estimates: mean of x mean of y 19.13727 16.77214 45
  • 46. t-Test (4) (a) Test the equality of variances assumption: ev <- var.test(a, c)$p.value (b) Test the normality assumption: an <- shapiro.test(a)$p.value bn <- shapiro.test(c)$p.value 46
  • 47. Chi-squared Test (1) Construct hypotheses based on qualitative – categorical data: myTable <- table(am, cyl) myTable cyl am 4 6 8 automatic 3 4 12 manual 8 3 2 47
  • 48. Chi-squared Test (2) chisq.test(myTable) Pearson's Chi-squared test data: myTable X-squared = 8.7407, df = 2, p-value = 0.01265 The expected counts under the null hypothesis: hisq.test(myTable)$expected cyl am 4 6 8 automatic 6.53125 4.15625 8.3125 manual 4.46875 2.84375 5.6875 48
  • 50. ANOVA (1)  A method which partitions the total variation in the response into the components (sources of variation) in the above model is called the analysis of variance. table(N, S, Rep) N <- factor(N) S <- factor(S) Rep <- factor(Rep) 50
  • 51. ANOVA (2)  The best way to understand the two significant interaction terms is to plot them using interaction.plot like this: interaction.plot(S, N, Yield) 51
  • 53. ANOVA (4) model <- aov(Yield ~ N * S) #CRD summary(model) Df Sum Sq Mean Sq F value Pr(>F) N 2 4.5818 2.2909 42.7469 1.230e-08 *** S 3 0.9798 0.3266 6.0944 0.003106 ** N:S 6 0.6517 0.1086 2.0268 0.101243 Residuals 24 1.2862 0.0536 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 53
  • 54. ANOVA (5) par(mfrow = c(2, 2)) plot(model) ANOVA assumptions:  Normality  Linearity  Constant variance  Independence 54
  • 55. ANOVA (6) model.tables(model, "means") Tables of means Grand mean 1.104722 N 0 180 230 0.6025 1.3142 1.3975 S 0 10 20 40 0.8289 1.1556 1.1678 1.2667 S N 0 10 20 40 0 0.5600 0.7733 0.5233 0.5533 180 0.8933 1.2900 1.5267 1.5467 230 1.0333 1.4033 1.4533 1.7000 55
  • 56. ANOVA (7) model.tables(model, se=TRUE) ....... Standard errors for differences of means N S N:S 0.0945 0.1091 0.1890 replic. 12 9 3 Plot.design(Yield ~ N * S) 56
  • 57. ANOVA (8) mc <- TukeyHSD(model, "N", ordered = TRUE); mc Tukey multiple comparisons of means 95% family-wise confidence level factor levels have been ordered Fit: aov(formula = Yield ~ N * S) $N diff lwr upr p adj 180-0 0.71166667 0.4756506 0.9476827 0.0000003 230-0 0.79500000 0.5589840 1.0310160 0.0000000 230-180 0.08333333 -0.1526827 0.3193494 0.6567397 57
  • 59. ANOVA (10) summary(aov(Yield ~ N * S + Error(Rep))) #RCB Error: Rep Df Sum Sq Mean Sq F value Pr(>F) Residuals 2 0.30191 0.15095 Error: Within Df Sum Sq Mean Sq F value Pr(>F) N 2 4.5818 2.2909 51.2035 5.289e-09 *** S 3 0.9798 0.3266 7.3001 0.001423 ** N:S 6 0.6517 0.1086 2.4277 0.059281 . Residuals 22 0.9843 0.0447 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 59
  • 60. ANOVA (11)  In a split-plot design, different treatments are applied to plots of different sizes. Each different plot size is associated with its own error variance.  The model formula is specified as a factorial, using the asterisk notation. The error structure is defined in the Error term, with the plot sizes listed from left to right, from largest to smallest, with each variable separated by the slash operator /. model <- aov(Yield ~ N * S + Error(Rep/N)) 60
  • 61. ANOVA (12) Error: Rep Df Sum Sq Mean Sq F value Pr(>F) Residuals 2 0.30191 0.15095 Error: Rep:N Df Sum Sq Mean Sq F value Pr(>F) N 2 4.5818 2.29088 55.583 0.001206 ** Residuals 4 0.1649 0.04122 Error: Within Df Sum Sq Mean Sq F value Pr(>F) S 3 0.97983 0.32661 7.1744 0.002280 ** N:S 6 0.65171 0.10862 2.3860 0.071313 . Residuals 18 0.81943 0.04552 61
  • 62. ANOVA (13)  Analysis of Covariance: # f is treatment factor # x is variate acts as covariate model <- aov(y ~ x * f)  Split both main effects into linear and quadratic parts. contrasts <- list(N = list(lin=1, quad=2), S = list(lin=1, quad=2)) summary(model, split=contrasts) 62
  • 63. PCA (1)  The idea of principal components analysis (PCA) is to find a small number of linear combinations of the variables so as to capture most of the variation in the dataframe as a whole. d2 <- cbind(wt, disp/10, hp/10, mpg, qsec) colnames(d2) <- c("wt", "disp", "hp", "mpeg", "qsec") 63
  • 64. PCA (2) model <- prcomp(d2) model Standard deviations: [1] 14.6949595 3.9627722 2.8306355 1.1593717 Rotation: PC1 PC2 PC3 PC4 wt -0.05887539 0.05015401 -0.07513271 -0.16910728 disp -0.83186362 0.47519625 0.28005113 0.04080894 hp -0.40572567 -0.83180078 0.24611265 -0.28768795 mpeg 0.36888799 0.12190490 0.91398919 -0.09385946 qsec 0.06200759 0.25479354 -0.14134625 -0.93710373 64
  • 65. PCA (3) summary(model) Importance of components: PC1 PC2 PC3 Standard deviation 14.6950 3.96277 2.83064 Proportion of Variance 0.8957 0.06514 0.03323 Cumulative Proportion 0.8957 0.96082 0.99405 65
  • 66. PCA (4) plot(model) biplot(model) 66
  • 67. Clustering (1)  We define similarity on the basis of the distance between two samples in this n-dimensional space. Several different distance measures could be used to work out the distance from every sample to every other sample. This quantitative dissimilarity structure of the data is stored in a matrix produced by the dist function: rownames(d2) <- rownames(mtcars) my.dist <- dist(d2, method="euclidian") 67
  • 68. Clustering (2)  Initially, each sample is assigned to its own cluster, and then the hclust algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster (see ?hclust for details). my.hc <- hclust(my.dist, "ward") 68
  • 69. Clustering (3)  We can plot the object called my.hc, and we specify that the leaves of the hierarchy are labeled by their plot numbers plot(my.hc, hang=-1) g <- rect.hclust(my.hc, k=4, border="red") Note: When the hang argument is set to '-1' then all leaves end on one line and their labels hang down from 0. 69
  • 71. Clustering (5)  Partitioning into a number of clusters specified by the user. gr <- kmeans(cbind(disp, hp), 2) plot(disp, hp, col = gr$cluster, pch=19) points(gr$centers, col = 1:2, pch = 8, cex=2) 71
  • 73. Clustering (7) K-means clustering with 2 clusters of sizes 18, 14 Cluster means: disp hp 1 135.5389 98.05556 2 353.1000 209.21429 Clustering vector: [1] 1 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 1 1 1 2 1 2 1 Within cluster sum of squares by cluster: [1] 58369.27 93490.74 (between_SS / total_SS = 75.6 %) 73
  • 74. Clustering (8) x <- as.matrix(mtcars) heatmap(x, scale="column") 74
  • 75. Time Series (1)  First, make the data variable into a time series object # create time-series objects beer <- ts(beer, start=1956, freq=12)  It is useful to be able to turn a time series into components. The function stl performs seasonal decomposition of a time series into seasonal, trend and irregular components using loess. 75
  • 76. Time Series (2)  The remainder component is the residuals from the seasonal plus trend fit. The bars at the right-hand side are of equal heights (in user coordinates). # Decompose a time series into seasonal, # trend and irregular components using loess ts.comp <- stl(beer, s.window="periodic") plot(ts.comp) 76
  • 78. Programming (1)  We can extend the functionality of R by writing a function that estimates the standard error of the mean SEM <- function(x, na.rm = FALSE) { if (na.rm == TRUE) VAR <- x[!is.na(x)] else VAR <- x SD <- sd(VAR) N <- length(VAR) SE <- SD/sqrt(N - 1) return(SE) } 78
  • 79. Programming (2)  You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments. "%p%" <- function(x,y) paste(x,y,sep=" ") "Hi" %p% "Khaled" [1] "Hi Khaled" 79
  • 80. Programming (3) setwd("path/to/folder") sink("output.txt") cat("Intercept t Slope") a <- fit$coefficients[[1]] b <- fit$coefficients[[2]] cat(paste(a, b, sep="t")) sink() jpeg(filename="graph.jpg", width=600, height=600) plot(wt, mpg); abline(fit) dev.off() 80
  • 81. Programming (4)  The code for R functions can be viewed, and in most cases modified, if so is desired using fix() function.  You can trigger garbage collection by call gc() function which will report few memory usage statistics.  Basic tool for code timing is: system.time(commands)  tempfile() give a unique file name in temporary writable directory deleted at the end of the session. 81
  • 82. Programming (5)  Take control of your R code! RStudio is a free and open source integrated development environment for R. You can run it on your desktop (Windows, Mac, or Linux) :  Syntax highlighting, code completion, etc...  Execute R code directly from the source editor  Workspace browser and data viewer  Plot history, zooming, and flexible image & PDF export  Integrated R help and documentation  and more (http://www.rstudio.com/ide/) 82
  • 84. Programming (7)  If want to evaluate the quadratic x2−2x +4 many times so we can write a function that evaluates the function for a specific value of x: my.f <- function(x) { x^2 - 2*x + 4 } my.f(3) [1] 7 plot(my.f, -10, +10) 84
  • 86. Programming (9)  We can find the minimum of the function using: optimize(my.f, lower = -10, upper = 10) $minimum [1] 1 $objective [1] 3 which says that the minimum occurs at x=1 and at that point the quadratic has value 3. 86
  • 87. Programming (10)  We can integrate the function over the interval -10 to 10 using: integrate(my.f, lower = -10, upper = 10) 746.6667 with absolute error < 4.1e-12 which gives an answer together with an estimate of the absolute error. 87
  • 88. Programming (11) plot(my.f, -15, +15) v <- seq(-10,10,0.01) x <- c(-10,v,10) y <- c(0,my.f(v),0) polygon(x, y, col='gray') 88
  • 89. Publication-Quality Output (1)  Research doesn’t end when the last statistical analysis is completed. We need to include the results in a report. xtable function convert an R object to an xtable object, which can then be printed as a LaTeX table.  LaTeX is a document preparation system for high- quality typesetting (http://www.latex-project.org). library(xtable) print(xtable(model)) 89
  • 91. Publication-Quality Output (3)  ggplot2 package is an elegant alternative to the base graphics system, it has two complementary uses:  Producing publication quality graphics using very simple syntax that it similar to that of base graphics. ggplot2 tends to make smart default choices for color, scale etc.  Making more sophisticated/customized plots that go beyond the defaults. 91
  • 93. Final words!  How Large is Your Family? How many brothers and sisters are there in your family including yourself? The average number of children in families was about 2. Can you explain the difference between this value and the class average?  Birthday Problem! The problem is to compute the approximate probability that in a room of n people, at least two have the same birthday. 93
  • 94. Online Resources  http://tryr.codeschool.com  http://www.r-project.org  http://www.statmethods.net  http://www.r-bloggers.com  http://www.r-tutor.com  http://blog.revolutionanalytics.com/r 94
  • 95. Thank You 95

Hinweis der Redaktion

  1. c() function is short for concatenateb &gt; 4[1] FALSE FALSE FALSE TRUE TRUEFor complex conditions you can use logical operator where:! indicates logical negation (NOT) &amp; indicate logical AND | indicate logical OR %in% operator searches through all of the entries in the object
  2. help.start() Launch R HTML documentation# Argument List of a Functionargs(read.csv)function (file, header = TRUE, sep = &quot;,&quot;, quote = &quot;\\&quot;&quot;, dec = &quot;.&quot;, fill = TRUE, comment.char = &quot;&quot;, ...) NULL
  3. # User CommentR doesn&apos;t provide multiline or block comments, you must start each line of multiline comment with #For debugging purposes, you can also surround code that you want the interpreter to ignore with the statement if (FALSE) { … }
  4. Packages -&gt; Install package(s)… Select CRAN mirror, then browse all available packages on CRAN repositoryinstalled.packages() # List of all currently installed packagesinstall.packages(ggplot2) # Install package ggplot2 from CRAN mirrorNote: if you would like to be sure that you execute the function from specific package then you can use the full name like this: package::function()
  5. You can use row.names optional parameter to specifying one variable (i.e. column name in your imported data set) to represent row identifier (like plot #)Set and get working directory:setwd(&quot;/path/to/your/directory&quot;)getwd()Note:setwd() function won’t create a directory that doesn’t exist. If necessary, you can use the dir.create() function to create new directory, and then use setwd()Read and execute R Code from an external file:source(&quot;filename.R&quot;)
  6. The detach() function removes the data frame from the search path, it does nothing to the data frame itself. This function is optional but is good programming practice and should be included routinely (see also with() function).# List and remove objects:ls()rm(VAR1, VAR2)rm(list = ls())#How to add one more calculated column into your data frame:data &lt;- transform(data, RYD=SYD/BYD)# Example on date using R language:startday &lt;- as.Date(&quot;2002-08-15&quot;)today &lt;- Sys.Date()days &lt;- today - startday
  7. Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector.a &lt;- c(1, 2, 5, 3, 6, -2, 4)b &lt;- c(&quot;one&quot;, &quot;two&quot;, &quot;three&quot;)c &lt;- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)A matrix is a two-dimensional array where each element has the same mode (numeric,character, or logical). Matrices are created with the matrix function.y &lt;- matrix(1:20, nrow=5, ncol=4, byrow=TRUE)Arrays are similar to matrices but can have more than two dimensions. They are created with an array function.z &lt;- array(1:24, c(2, 3, 4))A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, etc.). data &lt;- data.frame(a, b, c, y, z)
  8. #XMLlibrary(XML)cdCatalog &lt;- xmlToDataFrame(&quot;http://www.w3schools.com/xml/cd_catalog.xml&quot;)countryCdCatalog &lt;- split(cdCatalog, cdCatalog$COUNTRY)class(countryCdCatalog)[1] &quot;list&quot;names(countryCdCatalog)[1] &quot;EU&quot; &quot;Norway&quot; &quot;UK&quot; &quot;USA&quot; countryCdCatalog$EUNote: be sure that any missing data is properly coded as missing before analyzing the data or the results will be meaningless. For example if the value -999 refer to the missing observation in your Yield data, you can fix it using the following command:x[x == -999] &lt;- NA
  9. mean(x, trim=0.05, na.rm=TRUE)Provides the trimmed mean, dropping the highest and lowest 5 percent of scores as well as any missing values.summary() function will return frequencies for factors and logical vectors.sqrt(x) is the same as x^(0.5)Functions of the form is.datatype() return TRUE or FALSE, while functions of the form as.datatype() converts the argument to that type.Data types: numeric, character, logical, vector, factor, matrix, array, data.frameThe long/detailed way to calculate sd (i.e. standard deviation):n &lt;- length(x)x.mean &lt;- sum(x) / nss &lt;- sum((x – x.mean)^2)x.sd &lt;- sqrt(ss / (n – 1))
  10. signif(24+pi/100, digits=6) # returns 24.0314 (i.e. round x to the specified number of significant digits)Sequence generation: seq(from, to) or seq(from, to, by)The sample() function enables you to take a random sample (with or without replacement) of size n from a dataset (this can be useful in the bootstrapping technique): sample(x, n, replace=FALSE)sample(c(&quot;H&quot;, &quot;T&quot;), 10, replace=TRUE, prob=c(0.53, 0.47))Note: to ensure that all trainees will get the same randomization if they run the code on their own machines you may use: set.seed(123)x &lt;- c(1, 4, 9, 16, 25, 36)diff(x) # returns c(3, 5, 7, 9, 11)Combine R objects by rows (i.e. rbind) or columns (i.e. cbind):X &lt;- c(0, 1, 2, 3, 4)Y &lt;- c(5, 6, 7, 8, 9)XY &lt;- cbind(X, Y)
  11. Back transformation:log(x) vs. exp(x)log10(x) vs. 10^xsqrt(x) vs. x^2# Scales (mean of 0 and sd of 1) values of x to ranks. # To only center data, use scale=FALSE# To only reduce data use center=FALSEscale(x, center=TRUE, scale=TRUE)# The appropriate representation of values such as # infinity and not a number (NaN) is providedx &lt;- 1/0 # Inf-x # - Infx-x # NaN1/Inf # 0# Classical example show the numerical computing problema &lt;- sqrt(2)a*a == 2 # FALSEa*a – 2 # 4.440892e-16
  12. # Details of recycling example calculation:1 + 1 = 22 + 2 = 43 + 3 = 64 + 1 = 55 + 2 = 76 + 3 = 97 + 1 = 88 + 2 = 109 + 3 = 1210 + 1 = 11
  13. # set the random seed to insure that you will get the same valuesset.seed(123)# generating a 6 x 5 matrix containing random normal variatesmydata &lt;- matrix(rnorm(30), nrow=6)# calculate trimmed column means (in this case, means based on the middle %60# of the data, with the bottom 20 percent and top 20 percent of values discarded)apply(mydata, 2, mean, trim=0.2)
  14. substr(month.name, 2, 3)[1] &quot;an&quot; &quot;eb&quot; &quot;ar&quot; &quot;pr&quot; &quot;ay&quot; &quot;un&quot; &quot;ul&quot; &quot;ug&quot; &quot;ep&quot; &quot;ct&quot; &quot;ov&quot; &quot;ec&quot;paste(&quot;*&quot;, month.name[1:4], &quot;*&quot;, sep=&quot; &quot;)[1] &quot;* January *&quot; &quot;* February *&quot; &quot;* March *&quot; &quot;* April *&quot; letters[1:4][1] &quot;a&quot; &quot;b&quot; &quot;c&quot; &quot;d&quot;LETTERS[1:4][1] &quot;A&quot; &quot;B&quot; &quot;C&quot; &quot;D&quot;sub(&quot;\\\\s&quot;, &quot;.&quot;, &quot;Hello World&quot;) # returns &quot;Hello.World&quot;strsplit(&quot;Hello World&quot;, &quot;\\\\s+&quot;) # returns list contains two elementsstrsplit(month.name, c(&quot;a&quot;, &quot;e&quot;)) # the recycling rule# search for regular expression pattern in the string and returns matching indicesx &lt;- regexpr(&quot;pattern&quot;, &quot;string&quot;, perl=TRUE)
  15. Skewness = 0 (Symmetric), positive (Skewed to right), negative (Skewed to left)Kurtosis is positive (Leptokurtic), Kurtosis is negative (Platykurtic)x &lt;- c(2, 8, 1, 9, 7, 5)sort(x, decreasing=T)# 9 8 7 5 2 1rank(x)# 2 5 1 6 4 3which.min(x) # 3which.max(x)# 4
  16. Where as pie charts are ubiquitous in the business world, they are denigrated by most of statisticians. They recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume.To add colors to your categorized boxplot you can try this: plot(cyl, mpg, col=rainbow(nlevels(cyl)))Other vectors of contiguous colors includes: heat.colors(), terrain.colors(), topo.colors(), and cm.colors() For the gray levels you can use something like this: gray(0:n/n) where n &lt;- nlevels(cyl)
  17. Mathematical symbols: You can use expression function to display the text may contain mathematical symbols (i.e. use it in xlab, ylab, or main, etc…)expression(frac(mu,sqrt(2*pi*sigma^2)))The log parameter in plot function indicates whether or which axes should be plotted on a logarithmic scale:log=&quot;x&quot;, log=&quot;y&quot;, or log=&quot;xy&quot; for Log x-axis scale, Log y-axis scale, or Log x-axis and y-axis scales respectivelytck option in plot function enable you to define the length of tick mark as a fraction of plotting region (a negative number is outside the graph, a positive numbers is inside, 0 suppresses ticks, and 1 creates gridlines); the default is -0.01
  18. mtext() function places text in one of the four margins. The format is: mtext(&quot;text to plcae&quot;, side=n, line=m, ...)Where side define which margin to place text in (1=bottom, 2=left, 3=top, 4=right), while line indicate the line in the margin starting with 0 (closest to the plot area) and moving out.To create a plot based on probability densities rather than frequencies:hist(qsec, col=&quot;gray&quot;, probability = TRUE)lines(density(qsec), col = &quot;red&quot;, lwd = 3)You can define how many breaks are there in your histogram using breaks option (i.e. breaks=20)You can draw a stand alone density plot (that’s not being superimposed on another graphs) using the following command:plot(density(qsec))
  19. The boxplot summarizes a great deal of information very clearly. The horizontal line shows the median. The bottom and top of the box show the 25th and 75th percentiles, respectively. The vertical dashed lines show one of two things: either the maximum value or 1.5 times the interquartile range of the data (roughly 2 standard deviations). Points more than 1.5 times the interquartile range above and below are defined as outliers and plotted individually.Boxplot can be created for variables by group using formula instead of name of variable alone, example: y ~ A (i.e. a separate boxplot for numiric variable y is generated for each value of categorical variable A), while y ~ A*B formula would produce boxplot for each combination of levels in categorical variables A and B.quantile(qsec) 0% 25% 50% 75% 100% 14.5000 16.8925 17.7100 18.9000 22.9000 quantile(qsec, pro=seq(0, 1, 0.1)) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 14.500 15.534 16.734 17.020 17.340 17.710 18.180 18.607 19.332 19.990 22.900
  20. # ICARDA Tel-Hadya FarmLONG &lt;- c(35.99931833,36.00396667,36.03403667,36.02687333,36.025495,36.00249667,35.99931667,36.00312667,36.00401,35.99931833)LAT &lt;- c(36.93153667,36.91863167,36.939475,36.96093,36.9622,36.96005167,36.947595,36.94750667,36.93485667,36.93153667)
  21. You can display the relationship between three quantitative variables using 2D scatter plot and use the size of the plotted point to represent the value of the third variable. This approach is referred to as a bubble plot. You want the areas, rather than the radiuses of the circles, to be proportional to the values of a third variable. Given the formula for the radius of a circle r = sqrt(a/pi) the proper call:r &lt;- sqrt(disp[1:10]/pi)symbols(wt[1:10], mpg[1:10], circle=r, inches=0.30, fg=&quot;white&quot;, bg=&quot;lightblue&quot;, main=&quot;Bubble Plot with point size proportional to disp&quot;, xlab=&quot;Weight (lb/1000)&quot;, ylab=&quot;Miles/(US) gallon&quot;)text(wt[1:10], mpg[1:10], rownames(mtcars[1:10,]), cex=0.6)# 3D graph code/equationrequire(lattice)g &lt;- expand.grid(x = seq(-10, 10, 0.1), y = seq(-10, 10, 0.1))g$z &lt;- cos(sqrt(g$x^2 + g$y^2))*(1/(g$x^2 + g$y^2)^(1/3))wireframe(z ~ x * y, data = g, scales = list(arrows = FALSE), shade = TRUE)
  22. r = cov(x, y) / sdx * sdyr is called the correlation coefficient, numerator is called the covariance, two terms in the denominator are the standard deviation of x and ycov(x, y) = [1 / (n – 1)] ∑ (x – meanx)(y – meany)Where n is the number of observationsNote: you can also examining all bi-variate relationships in a given data frame in one go using: cor(data)
  23. names(cor.test(wt, qsec))[1] &quot;statistic&quot; &quot;parameter&quot; &quot;p.value&quot; &quot;estimate&quot; &quot;null.value&quot; &quot;alternative&quot; &quot;method&quot; &quot;data.name&quot; &quot;conf.int&quot; cor.test(wt, qsec)$p.value[1] 0.3389cor.test(wt, qsec)$statistic t -0.9719cor.test(wt, qsec)$estimate cor -0.1747159
  24. Scientific notation is a way of writing numbers that are too large or too small to be conveniently written in standard decimal notation. In scientific notation all numbers are written in the form of a times ten raised to the power of b where the exponent b is an integer, and the coefficient a is any real number:1.294e-10 = 1.294 * 10 ^ -10 = 1.294 / 10 ^ 10 = 0.0000000001294&quot;Correlation does not imply causation&quot; is a phrase used in science and statistics to emphasize that a correlation between two variables does not necessarily imply that one causes the other.
  25. Polynomial Regression (i.e. y = a + b*x + c*x2)quadratic &lt;- lm(y ~ x + I(x^2))summary(quadratic)# to plot it:plot(x, y)x2 &lt;- sort(x)y2 &lt;- fitted(quadratic)[order(x)]lines(x, fitted(quadratic))Mathematical functions can be used in formulas. For example: log(y) ~ x + z + w would predict log(y) from x, z, and w.y ~ log(x) + sin(z) would predict y = a + b * log(x) + c * sin(z)
  26. To add a label for each data point in the graph: text(wt, mpg, row.names(mtcars), cex=0.5, pos=4, col=&quot;red&quot;)Change font name and font size:par(family=&quot;serif&quot;, ps=12)Using the identify() function , you can label selected points in a scatter plot with their row number or row name using your mouse.identify(wt, mpg, labels=rownames(mtcars))the cursor will change from a pointer to a crosshair. Clicking on scatter plot points will label them until you select Stop from the Graphics Device menu or right-click on the graph and select Stop from the context menu.# Confidence and prediction bands:x &lt;- seq(min(wt),max(wt),length=100)p &lt;- predict(fit, data.frame(wt=x), interval=&apos;prediction&apos;)lines(x, p[,2], col=&apos;red&apos;)lines(x, p[,3], col=&apos;red&apos;)p &lt;- predict(fit, data.frame(wt=x), interval=&apos;confidence&apos;)lines(x, p[,2], col=&apos;red&apos;, lty=2)lines(x, p[,3], col=&apos;red&apos;, lty=2)
  27. In fact, dropping some observation (outliers) produces a better model fit. But you need to be careful when deleting data. Your models should fit your data, not the other way around! In other cases, the unusual observation may be the most interesting thing about the data you have collected.“Which variables are most important in predicting the outcome?” You implicitly want to rank-order the predictors in terms of relative importance. There have been many attempts to develop a means for assessing the relative importance of predictors. The simplest has been to compare standardized regression coefficients. Standardized regression coefficients describe the expected change in the response variable (expressed in standard deviation units) for a standard deviation change in a predictor variable, holding the other predictor variables constant.Reference: Listing 8.16 (&quot;R in Action&quot; book) relweights() function for calculating relative importance of predictors
  28. LD50 is the median lethal dose of a toxic substance, i.e., that dose of a chemical which kills half the members of a tested population. Basically, what we have is a predictor that is the dose of a chemical and a binary response variable that indicates whether the individual dies or not. The data consist of numbers dead and initial batch size for several doses (e.g. pesticide application), and we wish to know what dose kills 50% of the individuals.dead &lt;- c( 0, 10, 16, 53, 76, 83)dose &lt;- c( 1, 2, 3, 5, 10, 20)batch &lt;- c(85, 85, 85, 85, 85, 85)y &lt;- cbind(dead, batch-dead)model &lt;- glm(y ~ dose, binomial)plot(dose, dead/batch)xv &lt;- seq(0, 20, 0.1)yv &lt;- predict(model, list(dose=xv), type=&quot;response&quot;)lines(xv, yv)Predict Doses for Binomial Assay model: the function dose.p from the MASS library is run with the model object, specifying the proportion killed.library(MASS)dose.p(model, p=c(0.5,0.9,0.95))
  29. # function to obtain R-Squared from the data rsq &lt;- function(formula, data, indices) { d &lt;- data[indices,] # allows boot to select sample fit &lt;- lm(formula, data=d) return(coef(fit)) # Bootstrapping several Statistics} # bootstrapping with 1000 replications results &lt;- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp, parallel=&quot;multicore&quot;) # Linux# view resultsresults plot(results, index=1) # intercept plot(results, index=2) # wt plot(results, index=3) # disp# get 95% confidence interval boot.ci(results, type=&quot;bca&quot;, index=1) # intercept boot.ci(results, type=&quot;bca&quot;, index=2) # wt boot.ci(results, type=&quot;bca&quot;, index=3) # disp
  30. un-paired case: ab.t = (mean(a)-mean(b))/sqrt(var(a)/length(a) + var(b)/length(b))paired case: ab.t = mean(a-b) / sqrt(var(a-b) / length(a-b))Luckily most numeric functions have a na.rm=TRUE option that removes missing values prior to calculations, and applies the function to the remaining values.
  31. (a) Test the equality of variances assumption:if ev &gt; 0.05 we have to use var.equal=TRUE option in the t.test else use var.equal=FALSE (the default value)(b) Test the normality assumption:if an &lt; 0.05 or bn &lt; 0.05 then we have to use wilcox.test instead of t.test
  32. You can turn your frequencies table into proportions using prop.table() function: prop.table(myTable)You can transpose Matrix using: t(myTable)Note: Attributes can be attached to any R object, all attributes can be retrieved using attributes function, or any particular attribute can be accessed or modified using attr function.A matrix is represented as an object/vector of data with &quot;dim&quot; attribute, in this example there is also an extra attribute called &quot;dimnames“rownames(myTable) &lt;- c(&quot;Automatic&quot;, &quot;Manual&quot;) # see also colnames function
  33. The test is not applicable if the expected count for any of the cells is less than 5. R will warn you if this is the case and suggest that the validity of the test results is questionable.
  34. Mosaic Plots are the swiss army knife of categorical data displays. Whereas bar charts are stuck in their univariate limits, mosaic plots and their variants open up the powerful visualization of multivariate categorical data.
  35. # If you import data from Turkey Excel file we have to use dec=&quot;,&quot; # Data file name is &quot;2F RCB.csv&quot;data&lt;-read.csv(file.choose(), header=TRUE, sep=&quot;;&quot;, dec=&quot;.&quot;)attach(data)You can check factor levels using this function: levels(x)You can check the number of factor levels using this function: nlevels(x)Note: by default, character variables are converted to factors when importing data, to suppress this include the option in the read.table function:stringAsFactors=FALSENote: you can undo factor function effect by using as.numeric(x) or as.character(x) functions depends on vector data typeAttributes can be attached to any R object, all attributes can be retrieved using attributes function, or any particular attribute can be accessed or modified using attr function.A factor is represented as an object/vector of data with two extra attributes $levels with a list of distinct values and $class&quot;factor&quot;In factor function, if ordered argument is TRUE, the factor levels are assumed to be ordered. For compatibility with S there is also a function ordered.
  36. A graph of yield vs. S at the 3 levels of N seems to indicate a classical nutrient response interaction: no response to S at 0 N, contrasted by a strong response to S when N is no limiting.Experiment design: Two factor factorial RCBD in 3 repsTreatments, 3 x 4 factorial: 12 treatments of all possible combinations of two factors, nitrogen N (3 levels: 0, 180 and 230 kg/ha) and sulphur S (4 levels: 0, 10, 20 and 40 kg/ha)
  37. Broad-sense heritability (h2) of the trait is the ratio of genetic variability (σ2g) to phenotypic variability (σ2g + σ2e). Generally, estimation of variance components is based on ANOVA table:σ2e = Residual Mean Sqσ2g = (Genotypes Mean Sq – Residual Mean Sq) / ReplicationsThus an estimate of heritability is: h2 = (VR – 1) / (VR + Replications – 1)Where VR is the variance ratio for genotypesaov.table &lt;- summary(model)[[1]]svr&lt;- aov.table$&quot;F value&quot;[2]h2 &lt;- (svr - 1) / (svr + nlevels(Rep) - 1)
  38. Symbols commonly used in R formulas:~ Separate response on the left from the explanatory variables on the right+ Separate explanatory variables: Denotes an interaction between predictor variables* A shortcut for denoting all possible interactions^ Denotes interactions up to a specified degree. The code y ~ (x + z + w)^2 expands to y ~ x + z + w + x:z + x:w + z:w. A place holder for all other variables in the data frame except the dependent variable- A minus sign removes a variable from the equation. For example, y ~ (x + z + w)^2 – x:w expands to y ~ x + z + w + x:z + z:w-1 Suppresses the intercept. For example, the formula y ~ x -1 fits a regression of y on x, and forces the line through the origin at x=0
  39. The order in which the effects appear in a formula matters only when:There’s more than one factor and the design is unbalanced (i.e. the model y~A*B will not produce the same results as the model y~B*A)Covariates are present (i.e. covariates should be listed first, followed by main effects, followed by two-way interactions, and so on)
  40. PCA was invented in 1901 by Karl Pearson. Principal component analysis is a variable reduction procedure. It is useful when you have obtained data on a number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. Thus, the objective of principal component analysis is to reduce the dimensionality of the data set and to identify meaningful underlying variables. It is more useful as a visualization tool than as an analytical method. The basic idea in PCA is to find the components that explain the maximum amount of variance in original variables by few linearly transformed uncorrelated components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible.
  41. (important) scale option in the prcomp function: * TRUE: PCA based on correlation matrix * FALSE: PCA based on covariance matrix (default)PC1 = - 0.059 * wt - 0.832 * disp - 0.406 * hp + 0.369 * mpeg + 0.062 * qsecPC2 = - 0.05 * wt + 0.475 * disp - 0.832 * hp + 0.122 * mpeg + 0.255 * qsec
  42. model2 &lt;- prcomp(d2, scale=TRUE)summary(model2)Importance of components: PC1 PC2 PC3Standard deviation 1.9227 0.9803 0.39310Proportion of Variance 0.7394 0.1922 0.03091Cumulative Proportion 0.7394 0.9316 0.96247
  43. The angles between biplot vectors (arrows going from origin to factor loading coordinates) clearly show the relationships between the plant attributes measured during the trial (the cosine of the angle between any 2 vectors approximates their correlation). To estimate the level of any variable in any genotype, draw a perpendicular line from the genotype score to the biplot vector of interest.Reading from the biplot we can summarize as follows: while both cars 25 and 30 have roughly same level of &quot;hp&quot;, but car 25 has much higher &quot;disp&quot; that car 30Note: PC1 explains &gt; 89% of the variance in the dataset, leaving much less for PC2 to explain.
  44. the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of:* In “single” linkage method, distance between clusters is taken as distance between the closest neighbors and in “complete” linkage the distance between farthest neighbors determines distance between clusters. * “average” linkage defines the distance between two clusters as the average distance between all pairs of items where one member of a pair belongs to cluster1 and the other member of the pair belongs to cluster2. * In “centroid” linkage the distance between clusters is defined as distance between the centers of the clusters. Thus, groups once formed are represented by their mean values for each variable, that is, by their mean vector and inter-cluster distance is the distance between two such mean vectors.* In “ward” method at each step in the analysis, union of every possible pair of clusters is considered and the two clusters whose fusion results in the minimum increase in the information loss are combined. Ward defines an information loss in terms of error sum of squares (ESS) criterion.
  45. str(g)List of 4 $ : Named int [1:7] 5 12 13 14 22 23 25 ..- attr(*, &quot;names&quot;)= chr [1:7] &quot;Hornet Sportabout&quot; &quot;Merc 450SE&quot; &quot;Merc 450SL&quot; &quot;Merc 450SLC&quot; ... $ : Named int [1:7] 7 15 16 17 24 29 31 ..- attr(*, &quot;names&quot;)= chr [1:7] &quot;Duster 360&quot; &quot;Cadillac Fleetwood&quot; &quot;Lincoln Continental&quot; &quot;Chrysler Imperial&quot; ... $ : Named int [1:6] 18 19 20 26 27 28 ..- attr(*, &quot;names&quot;)= chr [1:6] &quot;Fiat 128&quot; &quot;Honda Civic&quot; &quot;Toyota Corolla&quot; &quot;Fiat X1-9&quot; ... $ : Named int [1:12] 1 2 3 4 6 8 9 10 11 21 ... ..- attr(*, &quot;names&quot;)= chr [1:12] &quot;Mazda RX4&quot; &quot;Mazda RX4 Wag&quot; &quot;Datsun 710&quot; &quot;Hornet 4 Drive&quot; ...
  46. A dendrogram is plotted to show the hierarchical relationships between the units, which is ordered according to the results of the cluster analysis.
  47. Note: Missing values in kmeans is not accepted!
  48. Monthly beer production in Australia from Jan. 1956 to Aug. 1995data &lt;- read.csv(&quot;C:/R Examples/beer.csv&quot;, header=TRUE, sep=&quot;;&quot;, dec=&quot;,&quot;)attach(data)Before:&gt; beer [1] 93.2 96.0 95.2 77.1 70.9 64.8 70.1 77.3 79.5 100.6 100.7 107.1 [13] 95.9 82.8 83.3 80.0 80.4 67.5 75.7 71.1 89.3 101.1 105.2 114.1 [25] 96.3 84.4 91.2 81.9 80.5 70.4 74.8 75.9 86.3 98.7 100.9 113.8 ...After:&gt; beer Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1956 93.2 96.0 95.2 77.1 70.9 64.8 70.1 77.3 79.5 100.6 100.7 107.11957 95.9 82.8 83.3 80.0 80.4 67.5 75.7 71.1 89.3 101.1 105.2 114.11958 96.3 84.4 91.2 81.9 80.5 70.4 74.8 75.9 86.3 98.7 100.9 113.8
  49. &gt; ts.comp Call:stl(x = beer, s.window = &quot;periodic&quot;)Components seasonal trend remainderJan 1956 3.571867 91.68633 -2.05819219Feb 1956 -5.267166 90.80817 10.45899171Mar 1956 6.091303 89.93002 -0.82132590Apr 1956 -7.833797 89.10416 -4.17036590May 1956 -11.326397 88.27830 -6.05190626
  50. The standard error of the mean can be estimated using the formula sd/√(n − 1), where sd is the standard deviation of the sample and n is the number of observations.The function first assesses whether missing values (values of &apos;NA&apos;) should be removed (based on the value of na.rm supplied by the function user). If the function is called with na.rm=TRUE, the is.na() function is used to deselect such values, before the standard deviation and length are calculated using the sd() and length() functions. Finally, the standard error of the mean is calculated and returned.Note: you can use either explicit return command, or the value returned by the function will be the value of the last statement executed.You can define your own operator of the form %any% using any text string in place of any. The function should be a function of two arguments.&quot;%p%&quot; &lt;- function(x, y) paste(x, y, sep=&quot; &quot;)&quot;Hi&quot; %p% &quot;Khaled&quot; # &quot;Hi Khaled&quot; To combine more than one value in the returned result:result &lt;- list(xname=x, yname=y)return(result) Note: in this case if value is returned then you can check for value$xname and value$yname or value[[&quot;xname&quot;]] and value[[&quot;yname&quot;]]If a &lt;- &quot;xname&quot; then value$a will not work, while value[[a]] will work
  51. In the sink() function you can also:* use option append=TRUE to append text to the file rather than overwriting it.* use option split=TRUE will send output to both the screen and file.Data Output:write.table(DATA, &quot;data.csv&quot;, quote = F, row.names = T, sep = &quot;,&quot;)In addition to jpeg(), you can use the functions pdf(), win.metafile(), png(), bmp(), tiff(), xfig(), and postscript() to save graphics in other formats.You can run R script file non-interactively and send output to another file.R CMD BATCH [options] script.R [out-file]
  52. sum function for example is primitive function and written in C language for performance issue and can’t be viewed in this manner, while cor function like most R functions are written in R itself.The ifelse construct is a compact and vectorized version of the if-else construct:y &lt;- ifelse(x&lt;0, 0, log(x))Error raised by a call to stop(&quot;your message&quot;)Warning raised by a call to warning(&quot;your message&quot;)class(mtcars) # &quot;data.frame&quot;typeof(mtcars)# &quot;list&quot;object.size(mtcars)# 5336 bytesstr(mtcars)
  53. Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. To learn more about Sweave, visit the Sweave home page (www.stat.uni-muenchen.de/~leisch/Sweave/). To learn more about LaTeX you can start here: http://www.latex-project.org/intro.html
  54. You can use TEXworks software to render the LaTeX tags in PDF format, TEXworks lowering the entry barrier to the TEX world, it is also free and open source package and you can get it from: http://www.tug.org/texworks/To get a valid render, xtableLaTeX output should be inserted into LaTeX document template such as the following simple one:\\documentclass{article}\\usepackage[utf8]{inputenc}\\usepackage[frenchb]{babel}\\begin{document}% Your LaTeX goes here\\end{document}
  55. The format is: qplot(x, y, data=, color=, shape=, size=, alpha=, geom=, method=, formula=, facets=, xlim=, ylim=, xlab=, ylab=, main=, sub=)where the parameters/options are defined below:alphaAlpha transparency for overlapping elements expressed as a fraction between 0 (complete transparency) and 1 (complete opacity).data Specifies a data frame.main, sub Character vectors specifying the title and subtitle.x , y Specifies the variables placed on the horizontal and vertical axis. For univariate plots (for example, histograms), omit y.xlab, ylabCharacter vectors specifying horizontal and vertical axis labels.xlim , ylimTwo-element numeric vectors giving the minimum and maximum values for the horizontal and vertical axes, respectively.color, shape, size, fillAssociates the levels of variable with symbol color, shape, or size. For line plots, color associates levels of a variable with line color. For density and box plots, fill associates fill colors with a variable. Legends are drawn automatically.facets Creates a trellis graph by specifying conditioning variables. Its value is expressed as rowvar ~ colvar (see the example in figure 16.10). To create trellis graphs based on a single conditioning variable, use rowvar~. or .~colvar.geomSpecifies the geometric objects that define the graph type. The geom option is expressed as a character vector with one or more entries. geom values include &quot;point&quot;, &quot;smooth&quot;, &quot;boxplot&quot;, &quot;line&quot;, &quot;histogram&quot;, &quot;density&quot;, &quot;bar&quot;, and &quot;jitter&quot;.method, formulaIf geom=&quot;smooth&quot;, a loess fit line and confidence limits are added by default. When the number of observations is greater than 1,000, a more efficient smoothing algorithm is employed. Methods include &quot;lm&quot; for regression, &quot;gam&quot; for generalized additive models, and &quot;rlm&quot; for robust regression. The formula parameter gives the form of the fit.For example, to add simple linear regression lines, you’d specify: geom=&quot;smooth&quot;, method=&quot;lm&quot;, formula=y~x. Changing the formula to y~poly(x,2) would produce a quadratic fit. Note that the formula uses the letters x and y, not the names of the variables.
  56. library(ggplot2)attach(mtcars)am &lt;- factor(am, labels=c(&quot;automatic&quot;, &quot;manual&quot;))qplot(wt, mpg, shape=20, color=am, main=&quot;1974 Motor Trend US magazine (Piece of cake!)&quot;, xlab=&quot;Weight (lb/1000)&quot;, ylab=&quot;Miles/(US) gallon&quot;, geom=c(&quot;point&quot;, &quot;smooth&quot;), method=&quot;lm&quot;)
  57. How Large is Your Family?The reason that the estimate is wrong is that families with 0 children could not have sent any to the class! So the average calculated is a random sample when sampling by child and not by family. In this case families with a large number of children are sampled more often - one time for each child.Birthday Problem!Ṕ(n) = 1 x (1 – 1/365) x (1 – 2/365) x ... x (1 – (n – 1)/365)The equation expresses the fact that the first person has no one to share a birthday, the second person cannot have the same birthday as the first (364/365), the third cannot have the same birthday as the first two (363/365), and in general the n th birthday cannot be the same as any of the n − 1 preceding birthdays.The event of at least two of the n persons having the same birthday is complementary to all n birthdays being different. Therefore, its probability P(n) is:P(n) = 1 – Ṕ(n)This probability surpasses 1/2 for n = 23 (with value about 50.7%). For more information:http://en.wikipedia.org/wiki/Birthday_problem
  58. Please don’t hesitate to contact us if you have any question, comment or feedback related to this session &lt;khaled.alshamaa@gmail.com&gt;Japanese attitude for work: If one can do it, I can do it. If no one can do it, I must do it.Middle Eastern attitude for work:Wallahi … if one can do it, let him do it. If no one can do it, ya-habibi how can I do it?