2. What is R?
● “R is a language and environment for statistical
computing and graphics”
● Paradigms: array, object-oriented, imperative,
functional, procedural, reflective
● Everything resides in memory (no big data)
● Easy to get started!
3. Why R?
● Free Software (GNU General Public License)
● Mature, v1.0 released on 2000
● Widely used
● Good documentation and manuals
● Lots of freely available packages
● Excellent graphic capabilities
4. Getting the data (CSV)
● MySQL
SELECT * INTO OUTFILE '/path/to/file.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
ESCAPED BY ‘’
LINES TERMINATED BY 'n'
FROM table WHERE <condition>;
● Hive + sed
INSERT OVERWRITE LOCAL DIRECTORY '/tmp_path/'
SELECT * FROM table
WHERE <condition>;
cat /tmp_path/* | sed 's/[Ctrl-V][Ctrl-A]/t/g' > out.txt
● Consider sampling!
5. Linear Regression
y=α+β x
n
̂
∑i=1 ( xi − ̄ )( y i − ̄ ) Cov [ x , y ]
x y
β= =
n
Var [ x ]
∑i=1 ( x i − ̄ )
x 2
̂ y ̂
α= ̄ −β x
Just use lm() in R!
(But check the assumptions)
6. Want more?
● Computing for Data Analysis – Roger D. Peng
www.coursera.org/course/compdata
● Statistics One – Andrew Conway
www.coursera.org/course/stats1
● An Introduction to R – The R Core Team
cran.r-project.org/doc/manuals/r-release/R-intro.pdf