Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
statistical computation using R- report
1. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
STATISTICAL COMPUTATION USING R
A Seminar Report
submitted in partial fulfillment of the requirements
for the award of the Degree of
MASTER OF COMPUTER APPLICATIONS
under the
UNIVERSITY OF CALICUT
by
KAMARUDHEEN KV
Register No:MKANMCA018
.
DEPARTMENT OF COMPUTER APPLICATIONS
MES COLLEGE OF ENGINEERING,
KUTTIPPURAM, MALAPPURAM- 679 573
April-2016
2. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
STATISTICAL COMPUTATION USING R
A Seminar Report
submitted in partial fulfillment of the requirements
for the award of the Degree of
MASTER OF COMPUTER APPLICATIONS
under the
UNIVERSITY OF CALICUT
by
KAMARUDHEEN KV
Register No:MKANMCA018
.
DEPARTMENT OF COMPUTER APPLICATIONS
MES COLLEGE OF ENGINEERING,
KUTTIPPURAM, MALAPPURAM- 679 573
April-2016
3. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
MES COLLEGE OF ENGINEERING
KUTTIPPURAM, KERALA -679573
(AN ISO 9001: 2008 CERTIFIED INSTITUTION & WITH NBA ACCREDITED DEPARTMENTS,
APPROVED BY AICTE AND AFFILIATED TO THE UNIVERSITY OF CALICUT)
DEPARTMENT OF COMPUTER APPLICATIONS
C E R T I F I C A T E
This is to certify that report entitled STATISTICAL COMPUTATION
USING R has been prepared and presented by Mr. KAMARUDHEEN KV (Register
No: MKANMCAO18), fifth semester student of the department, during the academic
year 2015-16, in partial fulfillment of the requirements for the award of Degree of Master
of Computer Applications under the University of Calicut.
Staff in Charge Head of the Department
Date:
4. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
ACKNOWLEDGEMENT
My endeavor stands incomplete without dedicating my gratitude to a few people who
have contributed towards the successful completion of my seminar. I pay my gratitude to
the Almighty for invisible help and blessing for the fulfillment of this work. At the outset
I express my heart full thanks to our Head of the Department, Prof.Hyderali. K for
permitting me to present this seminar.
I take this opportunity to express my profound gratitude to Mr. Pradeep Uduppa our
group tutor, for his valuable support and help in presenting my seminar.
I am also grateful to all our teaching and non-teaching staff for their encouragement,
guidance and whole-hearted support.
Last but not least, I am gratefully indebted to my family and friends, who gave me a
precious help in presenting my seminar.
Sincerely,
KAMARUDHEENKV
MKANMCA018
5. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
SYNOPSIS
The rapid and sustained increases in computing power starting from the second half of the
20th century have had a substantial impact on the practice of statistical science. Early
statistical models were almost always from the class of linear models, but powerful
computers, coupled with suitable numerical algorithms, caused an increased interest
in nonlinear models (such as neural networks) as well as the creation of new types, such
as generalized linear models and multilevel models.
R is rapidly becoming the leading language in data science and statistics. It is a
programming language and software environment for statistical computing and graphics
supported by the R Foundation for Statistical Computing. The R language is widely used
among statisticians and data miners for developing statistical software and data analysis.
R is an implementation of the S programming language combined with lexical scoping
semantics inspired by Scheme. S was created by John Chambers while at Bell Labs.
There are some important differences, but much of the code written for S runs unaltered.
6. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
TABLE OF CONTENTS
1. INTRODUCTION
2. R AS A STATISTICAL SOFTWARE
2.1 Programming Features
3. ADVANTAGES OVER OTHER STATISTICAL TOOLS
4. R PRELIMINARIES
4.1 Common Operators
5. R LANGUAGE ESSENTIALS
5.1 Expressions and Objects
5.2 Functions and Arguments
5.3 Vectors
5.4 Matrices and Arrays
5.5 Lists
5.6 Data Frames
5.7 Indexing
5.8 Commonly used method of data input
6. GRAPHICS
6.1 Standard Plots
7. CONCLUSIONS
8. REFERENCES
7. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
1. INTRODUCTION
The R system for statistical computing is an environment for data analysis and graphics.
The root of R is the S language, developed by John Chambers and colleagues (Becker et
al., 1988, Chambers and Hastie, 1992, Chambers, 1998) at Bell Laboratories (formerly
AT&T, now owned by Lucent Technologies) starting in the 1960s. The S language was
designed and developed as a programming language for data analysis tasks but in fact it is
a full-featured programming language in its current implementations. The development of
the R system for statistical computing is heavily influenced by the open source idea: The
base distribution of R and a large number of user contributed extensions are available
under the terms of the Free Software Foundation‟s GNU General Public License in source
code form. This license has two major implications for the data analyst working with R.
The complete source code is available and thus the practitioner can investigate the details
of the implementation of a special method, can make changes and can distribute
modifications to colleagues. As a side-effect, the R system for statistical computing is
available to everyone. All scientists, especially including those working in developing
countries, have access to state-of-the-art tools for statistical data analysis without
additional costs. R system itself, a collection of add-on packages, manuals,
documentation and more.
The fact that R is based on a formal computer language gives it tremendous flexibility.
Other systems present simpler interfaces in terms of menus and forms, but often the
apparent user friendliness turns into a hindrance in the longer run. Although elementary
statistics is often presented as a collection of fixed procedures, analysis of moderately
complex data requires ad hoc statistical model building, which makes the added
flexibility of R highly desirable.
8. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
2. R AS A STATISTICAL SOFTWARE
R and its libraries implement a wide variety of statistical and graphical techniques,
including linear and nonlinear modeling, classical statistical tests, time-series analysis,
classification, clustering, and others. R is easily extensible through functions and
extensions, and the R community is noted for its active contributions in terms of
packages. Many of R's standard functions are written in R itself which makes it easy for
users to follow the algorithmic choices made. For computationally intensive tasks, C,
C++, and Fortran code can be linked and call at run time. Advanced users can write C,
C++,Java,.NET or Python code to manipulate R objects directly.
R is highly extensible through the use of user-submitted packages for Specific functions
or specific areas of study. Due to its S heritage, R has stronger object-oriented
programming facilities than most statistical computing languages. Extending R is also
eased by its lexical scoping rules. Another strength of R is static graphics, which can
produce publication-quality graphs, including mathematical symbols. Dynamic and
interactive graphics are available through additional packages.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hard copy.
2.1 Programming features of R
R is an interpreted language; users typically access it through a command-line interpreter.
If a user types 2+2 at the R command prompt and presses enter, the computer replies with
4, as shown below:
> 2+2
[1] 4
R's data structures include vectors, matrices, arrays, data frames (similar to tables in a
relational database) and lists. R's extensible object system include objects for (among
others): regression models, time-series and geo-spatial coordinates. The scalar data type
9. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
was never a data structure of R. Instead, a scalar is represented as a vector with length
one.
R supports procedural programming with functions and, for some functions, object-
oriented programming with generic functions. A generic function acts differently
depending on the type of arguments passed to it. In other words, the generic function
dispatches the function (method) specific to that type of object. For example, R has a
generic print function that can print almost every type of object in R with a simple
print(objectname) syntax.
10. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
3. ADVANTAGES OVER OTHER STATISTICAL TOOLS
In R, statistical analyses are normally done as a series of steps, with intermediate results
being stored in objects, where the objects are later “interrogated” for the information of
interest. This is in contrast to other widely used programs (e.g., SAS and SPSS), which
print a large amount of output to the screen. Storing the results in objects so that
information can be retrieved at later times allows for easily using the results of one
analysis as input for another analysis. Furthermore, because the objects contain all
pertinent model information, model modification can be easily performed by
manipulation of the objects, a valuable benefit in many cases. R packages for new
innovations in statistical computing also tend to become available more quickly than do
such developments in other statistical software packages.
Using R requires a more thoughtful approach to data analysis than does using some other
programs, but that dates back to the idea of the S language being one where the user
interacts with the data, as opposed to a “shotgun” approach, where the computer program
provides everything thought to be relevant to the particular problem. For those who want
to stay on the cutting edge of statistical developments, using R is a must. The flexibility
of R is arguably unmatched by any other statistics program, as its object-oriented
programming language allows for the creation of functions that perform customized
procedures and/or the automation of tasks that are commonly performed.
11. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
4. R-PRELIMINARIES
Expressions are entered directly into an R session at the prompt, which is generally
denoted by the symbol >, The number sign (#) is used for comments; anything that
follows a number sign on a line is ignored.
4.1Common Operators
4.1.1 Assignment Operator
The expression <− is the assignment operator (assign what is on the right to the object on
the left), as is −> (assign what is on the left to the object on the right).
Eg: x<-2 Assigns the value 2 to the object x
x^2->y Assigns the value x^2 to the object y
4.1.2 Arithmatic Operators
+ Addition - Subtract
* Multiplication / Division
^ Exponential
4.1.3 Relational Operators
< Lessthan > Greaterthan
<= Lessthan Equal >= Greaterthan Equal
== Is Equal to != Not Equal
4.1.4 Logical Operator
! NOT
& AND
| OR
12. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
5. R LANGUAGE ESSENTIALS
This section outlines the basic aspects of the R language. It is necessary to do this in a
slightly superficial manner, with some of the finer points glossed over. The emphasis is
on items that are useful to know in interactive usage as opposed to actual programming.
5.1 Expressions and Objects
The basic interaction mode in R is one of expression evaluation. The user enters an
expression; the system evaluates it and prints the result. Some expressions are evaluated
not for their result but for side effects such as putting up a graphics window or writing to
a file. All R expressions return a value (possibly NULL), but sometimes it is “invisible”
and not printed. Expressions typically involve variable references, operators such as +,
and function calls, as well as some other items that have not been introduced yet.
Expressions work on objects. This is an abstract term for anything that can be assigned to
a variable. R contains several different types of objects.
5.2 Functions and Arguments
Many things in R is done using function calls, commands that look like an application of
a mathematical function of one or several variables; for example, log(x) or plot(height,
weight). The format is that a function name is followed by a set of parentheses containing
one or more arguments. For instance, in plot(height,weight) the function name is plot and
the arguments are height and weight. These are the actual arguments, which apply only
to the current call. A function also has formal arguments, which get connected to actual
arguments in the call.
When you write plot(height, weight), R assumes that the first argument corresponds to the
x-variable and the second one to the y-variable. This is known as positional matching.
Fortunately, R has methods to avoid this: Most arguments have sensible defaults and can
be omitted in the standard cases, and there are
nonpositional ways of specifying them when you need to depart from the default settings.
13. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
5.3 Vectors
A character vector is a vector of text strings, whose elements are specified and printed in
quotes:
> c("Huey","Dewey","Louie")
[1] "Huey" "Dewey" "Louie"
It does not matter whether use single- or double-quote symbols, as long as the left quote
is the same as the right quote:
> c(‟Huey‟,‟Dewey‟,‟Louie‟)
[1] "Huey" "Dewey" "Louie"
Logical vectors are constructed using the c function just like the other vector types:
> c(T,T,F,T)
[1] TRUE TRUE FALSE TRUE
It is much more common to use single logical values to turn an option on or off in a
function call.
5.4 Matrices and Arrays
A matrix in mathematics is just a two-dimensional array of numbers. Matrices are used
for many purposes in theoretical and practical statistics. However, matrices and also
higher-dimensional arrays do get used for simpler purposes as well, mainly to hold tables.
In R, the matrix notion is extended to elements of any type, Matrices and arrays are
represented as vectors with dimensions:
> x <- 1:12
> dim(x) <- c(3,4)
> x
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
14. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
The dim assignment function sets or changes the dimension attribute of x, causing R to
treat the vector of 12 numbers as a 3 × 4 matrix. Notice that the storage is column-major;
that is, the elements of the first column are followed by those of the second.
A convenient way to create matrices is to use the matrix function:
> matrix(1:12,nrow=3,byrow=T)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
Notice how the byrow=T switch causes the matrix to be filled in a row wise fashion rather
than column wise. Useful functions that operate on matrices include rownames, colnames,
and the transposition function t (notice the lowercase t as opposed to uppercase T for
TRUE), which turns rows into columns and vice versa.
5.5 Lists
It is sometimes useful to combine a collection of objects into a larger composite object.
This can be done using lists. You can construct a list from its components with the
function list
As an example, consider a set of data concerning pre- and postmenstrual energy intake in
a group of women. We can place these data in two vectors as follows:
> intake.pre <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515,8230,8770)
> intake.post <- c(3910,4220,3885,5160,5645,4680,5265,5975,6790,6900,7335)
To combine these individual vectors into a list,
> mylist <- list(before=intake.pre,after=intake.post)
> mylist
$before
[1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
$after
[1] 3910 4220 3885 5160 5645 4680 5265 5975 6790 6900 7335
15. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
The components of the list are named according to the argument names used in list.
Named components may be extracted like this:
> mylist$before
[1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
Many of R‟s built-in functions compute more than a single vector of values and return
their results in the form of a list.
5.6 Data Frames
A data frame corresponds to what other statistical packages call a “data matrix” or a “data
set”. It is a list of vectors and/or factors of the same length that are related “across” such
that data in the same position come from the same experimental unit (subject, animal,
etc.). In addition, it has a unique set of row names. We can create data frames from
preexisting variables:
> d <- data.frame(intake.pre,intake.post)
> d
intake.pre intake.post
1 5260 3910
2 5470 4220
3 5640 3885
4 6180 5160
5 6390 5645
6 6515 4680
7 6805 5265
8 7515 5975
9 7515 6790
10 8230 6900
11 8770 7335
As with lists, components (i.e., individual variables) can be accessed using
16. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
the $ notation:
> d$intake.pre
[1] 5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
5.7 Indexing
If you need a particular element in a vector, for instance the premenstrual energy intake
for woman no. 5,
> intake.pre[5]
[1] 6390
The brackets are used for selection of data, also known as indexing or subsetting. This
also works on the left-hand side of an assignment (so that you can say, for instance,
intake.pre[5] <- 6390) if we want to modify elements of a vector. If we want a sub vector
consisting of data for more than one woman, for instance nos. 3, 5, and 7, you can index
with a vector:
> intake.pre[c(3,5,7)]
[1] 5640 6390 6805
Note that it is necessary to use the c(...)-construction to define the vector consisting of the
three numbers 3, 5, and 7. intake.pre[3,5,7] would mean something completely different.
It would specify indexing into a three-dimensional array. Indexing with a vector also
works if the index vector is stored in a variable. This is useful when we need to index
several variables in the same way.
> v <- c(3,5,7)
> intake.pre[v]
[1] 5640 6390 6805
It is also worth noting that to get a sequence of elements, for instance the
first five, you can use the a:b notation:
> intake.pre[1:5]
[1] 5260 5470 5640 6180 6390
17. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
A neat feature of R is the possibility of negative indexing. We can get all observations
except nos. 3, 5, and 7 by writing
> intake.pre[-c(3,5,7)]
[1] 5260 5470 6180 6515 7515 7515 8230 8770
It is not possible to mix positive and negative indices. That would be highly ambiguous.
5.8 Commonly used method of data input
Following are the commonly used method of data input
5.8.1 Combine Function
The most useful R-command for quickly entering small data sets is the „C‟ or combine
function. This function combines term together.
Eg: > y<-c(1,5,3,9)
> y
[1] 1 5 3 9
The combine function can also be used to construct a vector of character strings
Eg: > Name<-c("bob","Jack","Simon")
> Name
[1] "bob" "Jack" "Simon"
5.8.2 Sequence Function
The sequence operator “:” generate consecutive no‟s while the sequence function thus the
same thing but more flexible.
Eg: > 1:4
[1] 1 2 3 4
seq function:
> seq(2,8,by=2)
[1] 2 4 6 8
5.8.3 Scan Function
18. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
Used to provide comparatively small quantities of data. The R command of this function
is,
Variable=scan()
After this command type in the data values separated by single space,or comma,
terminate data entry by double strike of enter key
Eg: A<-scan()
1: 25 50 63 64 55 47
7:
Read 6 items
5.8.4 Rep Function
In order to enter the data continuing repeated values, rep function is useful
y=rep(x,n)
create the value y, with values of x repeated n times
Eg: > x<-c(rep(1,4),rep(2,5))
> x
[1] 1 1 1 1 2 2 2 2 2
5.8.5 Class Function
This function is useful in deciding the class of the data object.
Eg: > x<-c(1,2,3,4)
> class(x)
[1] "numeric"
> y<-c("a","b","c")
> class(y)
[1] "character"
19. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
6 GRAPHICS
In order to produce graphical output, the user calls a series of graphics functions, each of
which produces either a complete plot, or adds some output to an existing plot. R graphics
follows a painters model," which means that graphics output occurs in steps, with later
output obscuring any previous output that it overlaps.
Functions in the graphics systems and graphics packages can be broken down into three
main types: high-level functions that produce complete plots; lowlevel functions that
add further output to an existing plot; and functions for working interactively with
graphical output.
6.1 Standard Plots
R provides the usual range of standard statistical plots, including scatterplots, boxplots,
histograms, barplots, piecharts, and basic 3D plots.
6.1.1 Scatter Plot
The function plot() can be used to plot data. Although it has a diverse array of arguments,
the most common specifications is of the form plot(x, y, type, col, xlim, ylim, xlab, ylab,
main) where x is the data to be represented on the abscissa (x-axis) of the plot; y is the
data to be represented on the ordinate (y-axis; note that the ordering of the values in x and
y must be consistent, meaning that the first element in y is linked to the first element in x,
etc.)
type is the type of plot (e.g., p for points, l for lines, n for no plotting but setting up the
structure of the plot so that points and/or lines are added later)
col is the color of the points and lines
xlim and ylim are the ranges of x-axis and y-axis, respectively
xlab and ylab are the labels of x axis and y-axis, respectively and
20. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
main is the title of the plot. All of the above arguments, except x and y, are optional.
Eg:
> age<-c(25,35,45,55,65)
> frequency<-c(55,93,113,90,85)
> plot(age,frequency,xlab=age,ylab=frequency,pch=1,main="frequency vs age")
6.1.2 Histogram
The function to plot histograms is hist(). The basic specification is of the form hist(x,
breaks, freq) where x is the data to be plotted, breaks defines the way to determine the
location and/or quantity of bins, and freq is a logical statement of whether the histogram
represents frequencies or probability densities.
Eg:
> midx<-seq(25,85,10)
> fr<-c(10,24,18,12,8,5,3)
> x<-rep(midx,fr)
> brk<-seq(20,90,10)
> hist(x,brk,main="Histogram",xlab="pocket money",ylab="no.of students")
21. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
6.1.3 Bar Plot
It is used to represents grouped data. A bar graph is a chart that uses either vertical or
horizontal bars to show comparisons among categories.
The function is to plot bar chart is,
barplot(x,y, type, col, xlim, ylim, xlab, ylab, main)
Eg:
> year<-1995:2000
> sales<-c(15,25,27,28,26,26.6)
> sales.year<-data.frame(year,sales)
> sales.year
year sales
1 1995 15.0
2 1996 25.0
3 1997 27.0
4 1998 28.0
5 1999 26.0
6 2000 26.6
> barplot(sales.year,xlab="year",ylab="sales",col="grey")
Histogram
pocket money
no.ofstudents
20 30 40 50 60 70 80 90
05101520
22. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
6.1.4 Box Plot
It is a convenient way of graphically depicting groups of numerical data through their
quartiles. Box plot may also have lines extending vertically from the boxes(whiskers)
indicating variability outside the upper and lower quartiles.
The function is to plot box plot is, boxplot().
Eg:
>x<-rnorm(100,1,1)
>boxplot(x,lwd=2)
year
sales
0510152025-101234
23. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
7 CONCLUSION
R is a flexible programming language designed to facilitate exploratory data analysis,
classical statistical tests, and high-level graphics.
R is a full-fledged programming language, with a rich complement of mathematical
functions, matrix operations and control structures. With its rich and ever-expanding
library of packages, R is on the leading edge of development in statistics, data analytics,
and data mining.
R has proven itself a useful tool within the growing field of big data and has been
integrated into several commercial packages, such as IBM SPSS and InfoSphere, as well
as Mathematica.
24. Seminar Report’16 Statistical Computation Using R
Department of Computer Applications MESCE, Kuttippuram
8 REFERENCES
Introductory Statistics with R- Peter Dalgaard(2nd
edition)
Statistical Computing with R- Eric Slud
Quick-R : Creating Graphs http://www.statmethods.net/graphs