Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
R the unsung hero of Big Data
1. R The unsung hero of Big Data
Dhafer Malouche
CEAFE, Beit El Hikma, June 21st, 2018
ESSAI-MASE-Carthage University
http://dhafermalouche.net
2. What’s R
• Free software environment for statistical computation
• Was created in 1992 by Ross Ihaka and Robert Gentleman[17] at the
University of Auckland, New Zealand
• Statistical computing
• Data Extraction
• Data Cleaning
• Data Visualization
• Modeling
• almost 13,000 packages
• IDE: RStudio
• One of the most popular Statistical Softwares
1
5. Some other features
• Reporting: Rmarkdown: html, pdf, word...
• Dynamic data visualization1
: Plotly, highcharter, rbokeh, dygraph,
leaflet, GoogleVis...
• Dashboards with flexdashboard
• Sophisticated statistical web apps with Shiny
• R can be called from Python, Julia...
1https://www.htmlwidgets.org
4
6. However
• R is not well-suited for working with data structures larger than about
10-20% of a computer’s RAM.
• Data exceeding 50% of available RAM are essentially unusable.
• We consider a data set large if it exceeds 20% of the RAM on a given
machine and massive if it exceeds 50%
5
7. Big Data and R
Can we then handle Big Data in
R?
6
8. Solutions offered by R
• Within R
• ff, ffbase, ffbase2, and bigmemory to enhance out-of-memory
performance
• Apply statistical methods to large R objects through the biglm, bigalgebra,
bigmemory...
• bigvis package for large data visualization
• faster data manipulation methods available in the data.table package
• Connecting R to famous Big Data tools
7
9. Types of data
• Medium sized files that can be loaded in R ( within memory limit but
processing is cumbersome (typically in the 1-2 GB range): read.csv,
read.table...
• Large files that cannot be loaded in R due to R/OS limitations. Two
other groups
• Large files: from 2 to 10 GB, they can be processed locally using some work
around solutions: read.table.ffdf, fread.
• Very Large files - ( > 10 GB) that needs distributed large scale computing:
Hadoop, H2O, Spark...
8
12. Comparing three methods to import a medium size data
• Standard read.csv
> system.time(DF1 <- read.csv("airline_20MM.csv",stringsAsFactors=FALSE))
user system elapsed
162.832 12.785 180.584
• Optimized read.csv
> ptm<-proc.time()
> length(readLines("airline_20MM.csv"))
[1] 20000001
> proc.time()-ptm
user system elapsed
26.097 0.588 26.766
> classes <- c("numeric",rep("character",3),rep("numeric",22))
> system.time(DF2 <- read.csv("airline_20MM.csv", header = TRUE, sep = ",",
+ stringsAsFactors = FALSE, nrow = 20000001, colClasses = classes))
user system elapsed
68.232 3.672 72.154
• fread
> system.time(DT1 <- fread("airline_20MM.csv"))
Read 20000000 rows and 26 (of 26) columns from 1.505 GB file in 00:00:18
user system elapsed
15.113 2.443 23.715
11
13. Large datasets with size 2-10 GB
• Too big for in-memory processing and for distributed computed files
• Two solutions
• big... packages: bigmemory, bigalgebra, biganalytics
• ff packages
12
14. ff, ffbase and ffbase2 packages
• Created in 2012 by Adler, Glaser, Nenadic, Ochlschlagel, and Zucchini.
Already more than 340,000 downloads.
• It chunks the dataset, and stores it on a hard drive.
• It includes a number of general data-processing functions:
• The ffbase package allows users to apply a number of statistical and
mathematical operations.
13
15. ff, ffbase and ffbase2 packages, Example
• Create a directory for the chunk files
> system("mkdir air20MM")
> list.dirs()
...
[121] "./air20MM"
....
• set the path to this newly created folder, which will store ff data chunks,
> options(fftempdir = "./air20MM")
14
17. ff, ffbase and ffbase2 packages, Example
• Size of the binary files (80 Mb each)
> file.size("./air20MM/ffdf2c9103fa5e4.ff")
[1] 8e+07
• The binary file of a given variable
> basename(filename(air20MM.ff$DayOfWeek))
[1] "ffdf2c92babdb9f.ff"
• Many other operations:
• Saving and loading ff objects,
• Compute tables with table.ff,
• Converting a numeric vector to a factor with cut.ff,
• Value matching with ffmatch
• bigglm.ffdf for Generalized Linear Model (GLM)
...and many others!!
16
20. Apache Spark
• Speed: Runs workloads 100x faster.
• Easily operable writing applications
quickly in Java, Scala, Python, R,
and SQL.
• Combine SQL, streaming, and
complex analytics.
19
21. sparklyr: R interface for Apache Spark
• Connect to Spark from R. The sparklyr package provides a complete
dplyr backend.
• Filter and aggregate Spark datasets then bring them into R for analysis and
visualization.
• Use Spark’s distributed machine learning library from R. Create extensions
that call the full Spark API and provide interfaces to Spark packages.
20
25. Managing data in Spark from R
• Copying data from R to Spark: dplyr package
> library(dplyr)
> iris_tbl <- copy_to(sc, iris)
• Reading csv files
> airline_20MM_sp <- spark_read_csv(sc, "airline_20MM",
"airline_20MM.csv")
• Munging and Managing data on Spark from R: quickly getting statistics on
Massive data.
• Execute SQL queries directly against tables within a Spark cluster.
> library(DBI)
> query1 <- dbGetQuery(con, "SELECT * FROM airline_20MM WHERE MONTH = 9")
22
26. Managing data in Spark from R
• Machine Learning procedures on Spark:
• ml_decision_tree for decision trees
• ml_linear_regression for regression models
• ml_gaussian_mixture for fitting Gaussian mixture distributions and EM
algorithm
• ....
• Example
> mtcars_tbl <- copy_to(sc, mtcars)
> partitions <- mtcars_tbl %>%
+ filter(hp >= 100) %>%
+ mutate(cyl8 = cyl == 8) %>%
+ sdf_partition(training = 0.5, test = 0.5, seed = 1099)
> fit <- partitions$training %>%
+ ml_linear_regression(response = "mpg", features = c("wt", + "cyl"))
23
27. More things to do on Spark from R
• Reading and Writing Data : CSV, JSON, and Parquet formats:
spark_write_csv, spark_write_parquet, spark_write_json
• Execute arbitrary R code across your cluster using spark_apply
> spark_apply(iris_tbl, function(data) {
+ data[1:4] + rgamma(1,2)
+ })
• View the Spark web console using the spark_web function:
> spark_web(sc)
24
28. H2O
• Software for machine learning and data analysis.
• Ease of Use
• Open source (the liberal Apache license)
• Easy to use Scalable to big data
• Well-documented and commercially supported.
• Website: https://www.h2o.ai/h2o/
25
29. How to install H2O?2
It takes few minutes, ∼ 134 Mb to download.
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download packages that H2O depends on.
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source",
repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/R")
# Finally, let's load H2O and start up an H2O cluster
library(h2o)
h2o.init()
2Procedure available in
http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/index.html
26
30. Munging data and ML in H2O from R
• Importing data files h2o.importFile
• Importing multiple files h2o.importFolder
• Combining data sets by columns and rows h2o.cbind and h2o.rbind
• Group one or more columns and apply a function to the result group_by
• Imputing missing values h2o.impute
• And the most important Machine Learning algorithms: PCA, Random Forests,
Regression Models and Classifications, Gradient Boosting Machine....
27
31. Hadoop and RHadoop
RHadoop is a collection of five R:
• rhdfs: Basic connectivity to the Hadoop Distributed File System. R
programmers can browse, read, write, and modify files stored in HDFS from
within R
• rhbase: Basic connectivity to the HBASE distributed database.
• plyrmr: Data manipulation operations.
• rmr2: Allows R developer to perform statistical analysis in R via Hadoop
MapReduce functionality on a Hadoop cluster.
• ravro: Read and write avro files from local and HDFS file system
28