R the unsung hero of Big Data

R The unsung hero of Big Data
Dhafer Malouche
CEAFE, Beit El Hikma, June 21st, 2018
ESSAI-MASE-Carthage University
http://dhafermalouche.net

What’s R
• Free software environment for statistical computation
• Was created in 1992 by Ross Ihaka and Robert Gentleman[17] at the
University of Auckland, New Zealand
• Statistical computing
• Data Extraction
• Data Cleaning
• Data Visualization
• Modeling
• almost 13,000 packages
• IDE: RStudio
• One of the most popular Statistical Softwares
1

Some other features
• Reporting: Rmarkdown: html, pdf, word...
• Dynamic data visualization1
: Plotly, highcharter, rbokeh, dygraph,
leaflet, GoogleVis...
• Dashboards with flexdashboard
• Sophisticated statistical web apps with Shiny
• R can be called from Python, Julia...
1https://www.htmlwidgets.org
4

However
• R is not well-suited for working with data structures larger than about
10-20% of a computer’s RAM.
• Data exceeding 50% of available RAM are essentially unusable.
• We consider a data set large if it exceeds 20% of the RAM on a given
machine and massive if it exceeds 50%
5

Big Data and R
Can we then handle Big Data in
R?
6

Solutions offered by R
• Within R
• ff, ffbase, ffbase2, and bigmemory to enhance out-of-memory
performance
• Apply statistical methods to large R objects through the biglm, bigalgebra,
bigmemory...
• bigvis package for large data visualization
• faster data manipulation methods available in the data.table package
• Connecting R to famous Big Data tools
7

Types of data
• Medium sized files that can be loaded in R ( within memory limit but
processing is cumbersome (typically in the 1-2 GB range): read.csv,
read.table...
• Large files that cannot be loaded in R due to R/OS limitations. Two
other groups
• Large files: from 2 to 10 GB, they can be processed locally using some work
around solutions: read.table.ffdf, fread.
• Very Large files - ( > 10 GB) that needs distributed large scale computing:
Hadoop, H2O, Spark...
8

Airline Data
airline20MM.csv ∼ 1.6 GB, 20 millions observations, 28 variables.
10

Comparing three methods to import a medium size data
• Standard read.csv
> system.time(DF1 <- read.csv("airline_20MM.csv",stringsAsFactors=FALSE))
user system elapsed
162.832 12.785 180.584
• Optimized read.csv
> ptm<-proc.time()
> length(readLines("airline_20MM.csv"))
[1] 20000001
> proc.time()-ptm
user system elapsed
26.097 0.588 26.766
> classes <- c("numeric",rep("character",3),rep("numeric",22))
> system.time(DF2 <- read.csv("airline_20MM.csv", header = TRUE, sep = ",",
+ stringsAsFactors = FALSE, nrow = 20000001, colClasses = classes))
user system elapsed
68.232 3.672 72.154
• fread
> system.time(DT1 <- fread("airline_20MM.csv"))
Read 20000000 rows and 26 (of 26) columns from 1.505 GB file in 00:00:18
user system elapsed
15.113 2.443 23.715
11

Large datasets with size 2-10 GB
• Too big for in-memory processing and for distributed computed ﬁles
• Two solutions
• big... packages: bigmemory, bigalgebra, biganalytics
• ff packages
12

ff, ffbase and ffbase2 packages
• Created in 2012 by Adler, Glaser, Nenadic, Ochlschlagel, and Zucchini.
Already more than 340,000 downloads.
• It chunks the dataset, and stores it on a hard drive.
• It includes a number of general data-processing functions:
• The ffbase package allows users to apply a number of statistical and
mathematical operations.
13

ff, ffbase and ffbase2 packages, Example
• Create a directory for the chunk ﬁles
> system("mkdir air20MM")
> list.dirs()
...
[121] "./air20MM"
....
• set the path to this newly created folder, which will store ff data chunks,
> options(fftempdir = "./air20MM")
14

• Import the data to R
> air20MM.ff <- read.table.ffdf(file="airline_20MM.csv",
+ sep=",", VERBOSE=TRUE,
+ header=TRUE, next.rows=400000,
+ colClasses=NA)
read.table.ffdf 1..400000 (400000) csv-read=3.224sec ffdf-write=0.397sec
...
...
read.table.ffdf 20000001..20000000 (0) csv-read=0.045sec
csv-read=141.953sec ffdf-write=67.208sec TOTAL=209.161sec
• Memory size, dimension
> format(object.size(air20MM.ff),units = "MB")
[1] "0.1 Mb"
> class(air20MM.ff)
[1] "ffdf"
> dim(air20MM.ff)
[1] 20000000 26
• One binary ﬁle for each variable
> list.files("./air20MM")
[1] "ffdf2c9103fa5e4.ff" "ffdf2c915cd46aa.ff" "ffdf2c919345992.ff"
[4] "ffdf2c919f020c5.ff" "ffdf2c91b4e0b28.ff" "ffdf2c91fdfba1f.ff"
[7] "ffdf2c920be7d19.ff" "ffdf2c922e00bb9.ff" "ffdf2c92321b092.ff"
[10] "ffdf2c9263bfa45.ff"
....
15

• Size of the binary ﬁles (80 Mb each)
> file.size("./air20MM/ffdf2c9103fa5e4.ff")
[1] 8e+07
• The binary ﬁle of a given variable
> basename(filename(air20MM.ff$DayOfWeek))
[1] "ffdf2c92babdb9f.ff"
• Many other operations:
• Saving and loading ff objects,
• Compute tables with table.ff,
• Converting a numeric vector to a factor with cut.ff,
• Value matching with ffmatch
• bigglm.ffdf for Generalized Linear Model (GLM)
...and many others!!
16

bigmemory, Example
• Reading big matrices
> ptm<-proc.time()
> air20MM.matrix <- read.big.matrix("airline_20MM.csv",
+ type ="integer", header = TRUE, backingfile = "air20MM.bin",
+ descriptorfile ="air20MM.desc", extraCols =NULL)
> proc.time()-ptm
user system elapsed
109.665 2.425 113.741
• Size, dimensions.
> dim(air20MM.matrix)
[1] 2.0e+07 2.6e+01
> object.size(air20MM.matrix)
696 bytes
• Files.
> file.exists("air20MM.desc")
[1] TRUE
> file.exists("air20MM.bin")
[1] TRUE
> file.size("air20MM.desc")
[1] 753
> file.size("air20MM.bin")/1024^3
[1] 1.937151
17

Apache Spark
• Speed: Runs workloads 100x faster.
• Easily operable writing applications
quickly in Java, Scala, Python, R,
and SQL.
• Combine SQL, streaming, and
complex analytics.
19

sparklyr: R interface for Apache Spark
• Connect to Spark from R. The sparklyr package provides a complete
dplyr backend.
• Filter and aggregate Spark datasets then bring them into R for analysis and
visualization.
• Use Spark’s distributed machine learning library from R. Create extensions
that call the full Spark API and provide interfaces to Spark packages.
20

Managing data in Spark from R
• Copying data from R to Spark: dplyr package
> library(dplyr)
> iris_tbl <- copy_to(sc, iris)
• Reading csv ﬁles
> airline_20MM_sp <- spark_read_csv(sc, "airline_20MM",
"airline_20MM.csv")
• Munging and Managing data on Spark from R: quickly getting statistics on
Massive data.
• Execute SQL queries directly against tables within a Spark cluster.
> library(DBI)
> query1 <- dbGetQuery(con, "SELECT * FROM airline_20MM WHERE MONTH = 9")
22

Managing data in Spark from R
• Machine Learning procedures on Spark:
• ml_decision_tree for decision trees
• ml_linear_regression for regression models
• ml_gaussian_mixture for ﬁtting Gaussian mixture distributions and EM
algorithm
• ....
• Example
> mtcars_tbl <- copy_to(sc, mtcars)
> partitions <- mtcars_tbl %>%
+ filter(hp >= 100) %>%
+ mutate(cyl8 = cyl == 8) %>%
+ sdf_partition(training = 0.5, test = 0.5, seed = 1099)
> fit <- partitions$training %>%
+ ml_linear_regression(response = "mpg", features = c("wt", + "cyl"))
23

More things to do on Spark from R
• Reading and Writing Data : CSV, JSON, and Parquet formats:
spark_write_csv, spark_write_parquet, spark_write_json
• Execute arbitrary R code across your cluster using spark_apply
> spark_apply(iris_tbl, function(data) {
+ data[1:4] + rgamma(1,2)
+ })
• View the Spark web console using the spark_web function:
> spark_web(sc)
24

H2O
• Software for machine learning and data analysis.
• Ease of Use
• Open source (the liberal Apache license)
• Easy to use Scalable to big data
• Well-documented and commercially supported.
• Website: https://www.h2o.ai/h2o/
25

How to install H2O?2
It takes few minutes, ∼ 134 Mb to download.
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download packages that H2O depends on.
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source",
repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/R")
# Finally, let's load H2O and start up an H2O cluster
library(h2o)
h2o.init()
2Procedure available in
http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/index.html
26

Munging data and ML in H2O from R
• Importing data files h2o.importFile
• Importing multiple files h2o.importFolder
• Combining data sets by columns and rows h2o.cbind and h2o.rbind
• Group one or more columns and apply a function to the result group_by
• Imputing missing values h2o.impute
• And the most important Machine Learning algorithms: PCA, Random Forests,
Regression Models and Classifications, Gradient Boosting Machine....
27

Hadoop and RHadoop
RHadoop is a collection of five R:
• rhdfs: Basic connectivity to the Hadoop Distributed File System. R
programmers can browse, read, write, and modify files stored in HDFS from
within R
• rhbase: Basic connectivity to the HBASE distributed database.
• plyrmr: Data manipulation operations.
• rmr2: Allows R developer to perform statistical analysis in R via Hadoop
MapReduce functionality on a Hadoop cluster.
• ravro: Read and write avro files from local and HDFS file system
28

Ressources
https://spark.rstudio.com/
29

R the unsung hero of Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to R the unsung hero of Big Data

Similar to R the unsung hero of Big Data (20)

Recently uploaded

Recently uploaded (20)

R the unsung hero of Big Data