Introduction to Data Mining with R and Data Import/Export
1. Introduction to Data Mining with R
and Data Import/Export in R
Yanchang Zhao
http://www.RDataMining.com
30 September 2014
1 / 25
2. Questions
I Do you know data mining and its algorithms and techniques?
2 / 25
3. Questions
I Do you know data mining and its algorithms and techniques?
I Have you heard of R?
2 / 25
4. Questions
I Do you know data mining and its algorithms and techniques?
I Have you heard of R?
I Have you used R in your research or projects?
2 / 25
5. Outline
Introduction to R
R Packages and Functions for Data Mining
Data Import and Export
Online Resources
3 / 25
6. What is R?
I R 1 is a free software environment for statistical computing
and graphics.
I R can be easily extended with 5,800+ packages available on
CRAN2 (as of 13 Sept 2014).
I Many other packages provided on Bioconductor3, R-Forge4,
GitHub5, etc.
I R manuals on CRAN6
I An Introduction to R
I The R Language De
7. nition
I R Data Import/Export
I . . .
1http://www.r-project.org/
2http://cran.r-project.org/
3http://www.bioconductor.org/
4http://r-forge.r-project.org/
5https://github.com/
6http://cran.r-project.org/manuals.html
4 / 25
8. Why R?
I R is widely used in both academia and industry.
I R was ranked no. 1 in the KDnuggets 2014 poll on Top
Languages for analytics, data mining, data science7 (actually
R has been no. 1 in 2011, 2012 & 2013!).
I The CRAN Task Views 8 provide collections of packages for
dierent tasks.
I Machine learning atatistical learning
I Cluster analysis
9. nite mixture models
I Time series analysis
I Multivariate statistics
I Analysis of spatial data
I . . .
7
http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html
8
http://cran.r-project.org/web/views/
5 / 25
10. Outline
Introduction to R
R Packages and Functions for Data Mining
Data Import and Export
Online Resources
6 / 25
12. cation with R
I Decision trees: rpart, party
I Random forest: randomForest, party
I SVM: e1071, kernlab
I Neural networks: nnet, neuralnet, RSNNS
I Performance evaluation: ROCR
7 / 25
13. Clustering with R
I k-means: kmeans(), kmeansruns()9
I k-medoids: pam(), pamk()
I Hierarchical clustering: hclust(), agnes(), diana()
I DBSCAN: fpc
I BIRCH: birch
9Functions are followed with (), and others are packages.
8 / 25
14. Association Rule Mining with R
I Association rules: apriori(), eclat() in package arules
I Sequential patterns: arulesSequence
I Visualisation of associations: arulesViz
9 / 25
15. Text Mining with R
I Text mining: tm
I Topic modelling: topicmodels, lda
I Word cloud: wordcloud
I Twitter data access: twitteR
10 / 25
16. Time Series Analysis with R
I Time series decomposition: decomp(), decompose(), arima(),
stl()
I Time series forecasting: forecast
I Time Series Clustering: TSclust
I Dynamic Time Warping (DTW): dtw
11 / 25
17. Social Network Analysis with R
I Packages: igraph, sna
I Centrality measures: degree(), betweenness(), closeness(),
transitivity()
I Clusters: clusters(), no.clusters()
I Cliques: cliques(), largest.cliques(), maximal.cliques(),
clique.number()
I Community detection: fastgreedy.community(),
spinglass.community()
12 / 25
18. R and Big Data
I Hadoop
I Hadoop (or YARN) - a framework that allows for the
distributed processing of large data sets across clusters of
computers using simple programming models
I R Packages: RHadoop, RHIPE
I Spark
I Spark - a fast and general engine for large-scale data
processing, which can be 100 times faster than Hadoop
I SparkR - R frontend for Spark
I H2O
I H2O - an open source in-memory prediction engine for big
data science
I R Package: h2o
I MongoDB
I MongoDB - an open-source document database
I R packages: rmongodb, RMongo
13 / 25
19. R and Hadoop
I Packages: RHadoop, RHive
I RHadoop10 is a collection of R packages:
I rmr2 - perform data analysis with R via MapReduce on a
Hadoop cluster
I rhdfs - connect to Hadoop Distributed File System (HDFS)
I rhbase - connect to the NoSQL HBase database
I . . .
I You can play with it on a single PC (in standalone or
pseudo-distributed mode), and your code developed on that
will be able to work on a cluster of PCs (in full-distributed
mode)!
I Step-by-Step Guide to Setting Up an R-Hadoop System
http://www.rdatamining.com/big-data/
r-hadoop-setup-guide
10https://github.com/RevolutionAnalytics/RHadoop/wiki
14 / 25
20. Outline
Introduction to R
R Packages and Functions for Data Mining
Data Import and Export
Online Resources
15 / 25
21. Data Import and Export 11
Read data from and write data to
I R native formats (incl. Rdata and RDS)
I CSV
23. les
I ODBC databases
I SAS databases
R Data Import/Export:
I http://cran.r-project.org/doc/manuals/R-data.pdf
11Chapter 2: Data Import and Export, in book R and Data Mining: Examples
and Case Studies. http://www.rdatamining.com/docs/RDataMining.pdf
16 / 25
24. Save and Load R Objects
I save(): save R objects into a .Rdata
26. le
I rm(): remove objects from R
a - 1:10
save(a, file = ./data/dumData.Rdata)
rm(a)
a
## Error: object 'a' not found
load(./data/dumData.Rdata)
a
## [1] 1 2 3 4 5 6 7 8 9 10
17 / 25
27. Save and Load R Objects - More Functions
I save.image():
save current workspace to a
28. le
It saves everything!
I readRDS():
read a single R object from a .rds
30. le
I Advantage of readRDS() and saveRDS():
You can restore the data under a dierent object name.
I Advantage of load() and save():
You can save multiple R objects to one
37. Read from Databases
I Package RODBC: provides connection to ODBC databases.
I Function odbcConnect(): sets up a connection to database
I sqlQuery(): sends an SQL query to the database
I odbcClose() closes the connection.
library(RODBC)
db - odbcConnect(dsn = servername, uid = userid,
pwd = ******)
sql - SELECT * FROM lib.table WHERE ...
# or read query from file
sql - readChar(myQuery.sql, nchars=99999)
myData - sqlQuery(db, sql, errors=TRUE)
odbcClose(db)
21 / 25
38. Read from Databases
I Package RODBC: provides connection to ODBC databases.
I Function odbcConnect(): sets up a connection to database
I sqlQuery(): sends an SQL query to the database
I odbcClose() closes the connection.
library(RODBC)
db - odbcConnect(dsn = servername, uid = userid,
pwd = ******)
sql - SELECT * FROM lib.table WHERE ...
# or read query from file
sql - readChar(myQuery.sql, nchars=99999)
myData - sqlQuery(db, sql, errors=TRUE)
odbcClose(db)
Functions sqlFetch(), sqlSave() and sqlUpdate(): read, write
or update a table in an ODBC database
21 / 25
39. Import Data from SAS
Package foreign provides function read.ssd() for importing SAS
datasets (.sas7bdat
40. les) into R.
library(foreign) # for importing SAS data
# the path of SAS on your computer
sashome - C:/Program Files/SAS/SASFoundation/9.2
filepath - ./data
# filename should be no more than 8 characters, without extension
fileName - dumData
# read data from a SAS dataset
a - read.ssd(file.path(filepath), fileName,
sascmd=file.path(sashome, sas.exe))
22 / 25
41. Import Data from SAS
Package foreign provides function read.ssd() for importing SAS
datasets (.sas7bdat
42. les) into R.
library(foreign) # for importing SAS data
# the path of SAS on your computer
sashome - C:/Program Files/SAS/SASFoundation/9.2
filepath - ./data
# filename should be no more than 8 characters, without extension
fileName - dumData
# read data from a SAS dataset
a - read.ssd(file.path(filepath), fileName,
sascmd=file.path(sashome, sas.exe))
Another way: using function read.xport() to read a
44. Outline
Introduction to R
R Packages and Functions for Data Mining
Data Import and Export
Online Resources
23 / 25
45. Online Resources
I RDataMining website
http://www.rdatamining.com
I R Reference Card for Data Mining
I R and Data Mining: Examples and Case Studies
I RDataMining Group on LinkedIn (7,000+ members)
http://group.rdatamining.com
I RDataMining on Twitter (1,700+ followers)
@RDataMining
I Free online courses
http://www.rdatamining.com/resources/courses
I Online documents
http://www.rdatamining.com/resources/onlinedocs
24 / 25