SlideShare a Scribd company logo
1 of 32
Download to read offline
R The unsung hero of Big Data
Dhafer Malouche
CEAFE, Beit El Hikma, June 21st, 2018
ESSAI-MASE-Carthage University
http://dhafermalouche.net
What’s R
• Free software environment for statistical computation
• Was created in 1992 by Ross Ihaka and Robert Gentleman[17] at the
University of Auckland, New Zealand
• Statistical computing
• Data Extraction
• Data Cleaning
• Data Visualization
• Modeling
• almost 13,000 packages
• IDE: RStudio
• One of the most popular Statistical Softwares
1
R Environment
2
RStudio
3
Some other features
• Reporting: Rmarkdown: html, pdf, word...
• Dynamic data visualization1
: Plotly, highcharter, rbokeh, dygraph,
leaflet, GoogleVis...
• Dashboards with flexdashboard
• Sophisticated statistical web apps with Shiny
• R can be called from Python, Julia...
1https://www.htmlwidgets.org
4
However
• R is not well-suited for working with data structures larger than about
10-20% of a computer’s RAM.
• Data exceeding 50% of available RAM are essentially unusable.
• We consider a data set large if it exceeds 20% of the RAM on a given
machine and massive if it exceeds 50%
5
Big Data and R
Can we then handle Big Data in
R?
6
Solutions offered by R
• Within R
• ff, ffbase, ffbase2, and bigmemory to enhance out-of-memory
performance
• Apply statistical methods to large R objects through the biglm, bigalgebra,
bigmemory...
• bigvis package for large data visualization
• faster data manipulation methods available in the data.table package
• Connecting R to famous Big Data tools
7
Types of data
• Medium sized files that can be loaded in R ( within memory limit but
processing is cumbersome (typically in the 1-2 GB range): read.csv,
read.table...
• Large files that cannot be loaded in R due to R/OS limitations. Two
other groups
• Large files: from 2 to 10 GB, they can be processed locally using some work
around solutions: read.table.ffdf, fread.
• Very Large files - ( > 10 GB) that needs distributed large scale computing:
Hadoop, H2O, Spark...
8
Medium sized files
9
Airline Data
airline20MM.csv ∼ 1.6 GB, 20 millions observations, 28 variables.
10
Comparing three methods to import a medium size data
• Standard read.csv
> system.time(DF1 <- read.csv("airline_20MM.csv",stringsAsFactors=FALSE))
user system elapsed
162.832 12.785 180.584
• Optimized read.csv
> ptm<-proc.time()
> length(readLines("airline_20MM.csv"))
[1] 20000001
> proc.time()-ptm
user system elapsed
26.097 0.588 26.766
> classes <- c("numeric",rep("character",3),rep("numeric",22))
> system.time(DF2 <- read.csv("airline_20MM.csv", header = TRUE, sep = ",",
+ stringsAsFactors = FALSE, nrow = 20000001, colClasses = classes))
user system elapsed
68.232 3.672 72.154
• fread
> system.time(DT1 <- fread("airline_20MM.csv"))
Read 20000000 rows and 26 (of 26) columns from 1.505 GB file in 00:00:18
user system elapsed
15.113 2.443 23.715
11
Large datasets with size 2-10 GB
• Too big for in-memory processing and for distributed computed files
• Two solutions
• big... packages: bigmemory, bigalgebra, biganalytics
• ff packages
12
ff, ffbase and ffbase2 packages
• Created in 2012 by Adler, Glaser, Nenadic, Ochlschlagel, and Zucchini.
Already more than 340,000 downloads.
• It chunks the dataset, and stores it on a hard drive.
• It includes a number of general data-processing functions:
• The ffbase package allows users to apply a number of statistical and
mathematical operations.
13
ff, ffbase and ffbase2 packages, Example
• Create a directory for the chunk files
> system("mkdir air20MM")
> list.dirs()
...
[121] "./air20MM"
....
• set the path to this newly created folder, which will store ff data chunks,
> options(fftempdir = "./air20MM")
14
ff, ffbase and ffbase2 packages, Example
• Import the data to R
> air20MM.ff <- read.table.ffdf(file="airline_20MM.csv",
+ sep=",", VERBOSE=TRUE,
+ header=TRUE, next.rows=400000,
+ colClasses=NA)
read.table.ffdf 1..400000 (400000) csv-read=3.224sec ffdf-write=0.397sec
read.table.ffdf 400001..800000 (400000) csv-read=3.174sec ffdf-write=0.205sec
read.table.ffdf 800001..1200000 (400000) csv-read=3.033sec ffdf-write=0.198sec
...
...
read.table.ffdf 20000001..20000000 (0) csv-read=0.045sec
csv-read=141.953sec ffdf-write=67.208sec TOTAL=209.161sec
• Memory size, dimension
> format(object.size(air20MM.ff),units = "MB")
[1] "0.1 Mb"
> class(air20MM.ff)
[1] "ffdf"
> dim(air20MM.ff)
[1] 20000000 26
• One binary file for each variable
> list.files("./air20MM")
[1] "ffdf2c9103fa5e4.ff" "ffdf2c915cd46aa.ff" "ffdf2c919345992.ff"
[4] "ffdf2c919f020c5.ff" "ffdf2c91b4e0b28.ff" "ffdf2c91fdfba1f.ff"
[7] "ffdf2c920be7d19.ff" "ffdf2c922e00bb9.ff" "ffdf2c92321b092.ff"
[10] "ffdf2c9263bfa45.ff"
....
15
ff, ffbase and ffbase2 packages, Example
• Size of the binary files (80 Mb each)
> file.size("./air20MM/ffdf2c9103fa5e4.ff")
[1] 8e+07
• The binary file of a given variable
> basename(filename(air20MM.ff$DayOfWeek))
[1] "ffdf2c92babdb9f.ff"
• Many other operations:
• Saving and loading ff objects,
• Compute tables with table.ff,
• Converting a numeric vector to a factor with cut.ff,
• Value matching with ffmatch
• bigglm.ffdf for Generalized Linear Model (GLM)
...and many others!!
16
bigmemory, Example
• Reading big matrices
> ptm<-proc.time()
> air20MM.matrix <- read.big.matrix("airline_20MM.csv",
+ type ="integer", header = TRUE, backingfile = "air20MM.bin",
+ descriptorfile ="air20MM.desc", extraCols =NULL)
> proc.time()-ptm
user system elapsed
109.665 2.425 113.741
• Size, dimensions.
> dim(air20MM.matrix)
[1] 2.0e+07 2.6e+01
> object.size(air20MM.matrix)
696 bytes
• Files.
> file.exists("air20MM.desc")
[1] TRUE
> file.exists("air20MM.bin")
[1] TRUE
> file.size("air20MM.desc")
[1] 753
> file.size("air20MM.bin")/1024^3
[1] 1.937151
17
Large Scale Computing
18
Apache Spark
• Speed: Runs workloads 100x faster.
• Easily operable writing applications
quickly in Java, Scala, Python, R,
and SQL.
• Combine SQL, streaming, and
complex analytics.
19
sparklyr: R interface for Apache Spark
• Connect to Spark from R. The sparklyr package provides a complete
dplyr backend.
• Filter and aggregate Spark datasets then bring them into R for analysis and
visualization.
• Use Spark’s distributed machine learning library from R. Create extensions
that call the full Spark API and provide interfaces to Spark packages.
20
Connecting Spark to R
21
Connecting Spark to R
21
Connecting Spark to R
21
Managing data in Spark from R
• Copying data from R to Spark: dplyr package
> library(dplyr)
> iris_tbl <- copy_to(sc, iris)
• Reading csv files
> airline_20MM_sp <- spark_read_csv(sc, "airline_20MM",
"airline_20MM.csv")
• Munging and Managing data on Spark from R: quickly getting statistics on
Massive data.
• Execute SQL queries directly against tables within a Spark cluster.
> library(DBI)
> query1 <- dbGetQuery(con, "SELECT * FROM airline_20MM WHERE MONTH = 9")
22
Managing data in Spark from R
• Machine Learning procedures on Spark:
• ml_decision_tree for decision trees
• ml_linear_regression for regression models
• ml_gaussian_mixture for fitting Gaussian mixture distributions and EM
algorithm
• ....
• Example
> mtcars_tbl <- copy_to(sc, mtcars)
> partitions <- mtcars_tbl %>%
+ filter(hp >= 100) %>%
+ mutate(cyl8 = cyl == 8) %>%
+ sdf_partition(training = 0.5, test = 0.5, seed = 1099)
> fit <- partitions$training %>%
+ ml_linear_regression(response = "mpg", features = c("wt", + "cyl"))
23
More things to do on Spark from R
• Reading and Writing Data : CSV, JSON, and Parquet formats:
spark_write_csv, spark_write_parquet, spark_write_json
• Execute arbitrary R code across your cluster using spark_apply
> spark_apply(iris_tbl, function(data) {
+ data[1:4] + rgamma(1,2)
+ })
• View the Spark web console using the spark_web function:
> spark_web(sc)
24
H2O
• Software for machine learning and data analysis.
• Ease of Use
• Open source (the liberal Apache license)
• Easy to use Scalable to big data
• Well-documented and commercially supported.
• Website: https://www.h2o.ai/h2o/
25
How to install H2O?2
It takes few minutes, ∼ 134 Mb to download.
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Next, we download packages that H2O depends on.
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source",
repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/R")
# Finally, let's load H2O and start up an H2O cluster
library(h2o)
h2o.init()
2Procedure available in
http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/index.html
26
Munging data and ML in H2O from R
• Importing data files h2o.importFile
• Importing multiple files h2o.importFolder
• Combining data sets by columns and rows h2o.cbind and h2o.rbind
• Group one or more columns and apply a function to the result group_by
• Imputing missing values h2o.impute
• And the most important Machine Learning algorithms: PCA, Random Forests,
Regression Models and Classifications, Gradient Boosting Machine....
27
Hadoop and RHadoop
RHadoop is a collection of five R:
• rhdfs: Basic connectivity to the Hadoop Distributed File System. R
programmers can browse, read, write, and modify files stored in HDFS from
within R
• rhbase: Basic connectivity to the HBASE distributed database.
• plyrmr: Data manipulation operations.
• rmr2: Allows R developer to perform statistical analysis in R via Hadoop
MapReduce functionality on a Hadoop cluster.
• ravro: Read and write avro files from local and HDFS file system
28
Ressources
https://spark.rstudio.com/
29

More Related Content

What's hot

Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
Mitsuharu Hamba
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
Chung-Tsai Su
 

What's hot (20)

Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode Apache Scoop - Import with Append mode and Last Modified mode
Apache Scoop - Import with Append mode and Last Modified mode
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Inside Parquet Format
Inside Parquet FormatInside Parquet Format
Inside Parquet Format
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in Hadoop
 
Hadoop basic commands
Hadoop basic commandsHadoop basic commands
Hadoop basic commands
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 

Similar to R the unsung hero of Big Data

Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
IndicThreads
 

Similar to R the unsung hero of Big Data (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Gur1009
Gur1009Gur1009
Gur1009
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Data Science
Data ScienceData Science
Data Science
 
OrientDB for real & Web App development
OrientDB for real & Web App developmentOrientDB for real & Web App development
OrientDB for real & Web App development
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)RR & Docker @ MuensteR Meetup (Sep 2017)
RR & Docker @ MuensteR Meetup (Sep 2017)
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Tweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский ДмитрийTweaking perfomance on high-load projects_Думанский Дмитрий
Tweaking perfomance on high-load projects_Думанский Дмитрий
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Cloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big DataCloudstone - Sharpening Your Weapons Through Big Data
Cloudstone - Sharpening Your Weapons Through Big Data
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Redispresentation apac2012
Redispresentation apac2012Redispresentation apac2012
Redispresentation apac2012
 

Recently uploaded

Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 

R the unsung hero of Big Data

  • 1. R The unsung hero of Big Data Dhafer Malouche CEAFE, Beit El Hikma, June 21st, 2018 ESSAI-MASE-Carthage University http://dhafermalouche.net
  • 2. What’s R • Free software environment for statistical computation • Was created in 1992 by Ross Ihaka and Robert Gentleman[17] at the University of Auckland, New Zealand • Statistical computing • Data Extraction • Data Cleaning • Data Visualization • Modeling • almost 13,000 packages • IDE: RStudio • One of the most popular Statistical Softwares 1
  • 5. Some other features • Reporting: Rmarkdown: html, pdf, word... • Dynamic data visualization1 : Plotly, highcharter, rbokeh, dygraph, leaflet, GoogleVis... • Dashboards with flexdashboard • Sophisticated statistical web apps with Shiny • R can be called from Python, Julia... 1https://www.htmlwidgets.org 4
  • 6. However • R is not well-suited for working with data structures larger than about 10-20% of a computer’s RAM. • Data exceeding 50% of available RAM are essentially unusable. • We consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50% 5
  • 7. Big Data and R Can we then handle Big Data in R? 6
  • 8. Solutions offered by R • Within R • ff, ffbase, ffbase2, and bigmemory to enhance out-of-memory performance • Apply statistical methods to large R objects through the biglm, bigalgebra, bigmemory... • bigvis package for large data visualization • faster data manipulation methods available in the data.table package • Connecting R to famous Big Data tools 7
  • 9. Types of data • Medium sized files that can be loaded in R ( within memory limit but processing is cumbersome (typically in the 1-2 GB range): read.csv, read.table... • Large files that cannot be loaded in R due to R/OS limitations. Two other groups • Large files: from 2 to 10 GB, they can be processed locally using some work around solutions: read.table.ffdf, fread. • Very Large files - ( > 10 GB) that needs distributed large scale computing: Hadoop, H2O, Spark... 8
  • 11. Airline Data airline20MM.csv ∼ 1.6 GB, 20 millions observations, 28 variables. 10
  • 12. Comparing three methods to import a medium size data • Standard read.csv > system.time(DF1 <- read.csv("airline_20MM.csv",stringsAsFactors=FALSE)) user system elapsed 162.832 12.785 180.584 • Optimized read.csv > ptm<-proc.time() > length(readLines("airline_20MM.csv")) [1] 20000001 > proc.time()-ptm user system elapsed 26.097 0.588 26.766 > classes <- c("numeric",rep("character",3),rep("numeric",22)) > system.time(DF2 <- read.csv("airline_20MM.csv", header = TRUE, sep = ",", + stringsAsFactors = FALSE, nrow = 20000001, colClasses = classes)) user system elapsed 68.232 3.672 72.154 • fread > system.time(DT1 <- fread("airline_20MM.csv")) Read 20000000 rows and 26 (of 26) columns from 1.505 GB file in 00:00:18 user system elapsed 15.113 2.443 23.715 11
  • 13. Large datasets with size 2-10 GB • Too big for in-memory processing and for distributed computed files • Two solutions • big... packages: bigmemory, bigalgebra, biganalytics • ff packages 12
  • 14. ff, ffbase and ffbase2 packages • Created in 2012 by Adler, Glaser, Nenadic, Ochlschlagel, and Zucchini. Already more than 340,000 downloads. • It chunks the dataset, and stores it on a hard drive. • It includes a number of general data-processing functions: • The ffbase package allows users to apply a number of statistical and mathematical operations. 13
  • 15. ff, ffbase and ffbase2 packages, Example • Create a directory for the chunk files > system("mkdir air20MM") > list.dirs() ... [121] "./air20MM" .... • set the path to this newly created folder, which will store ff data chunks, > options(fftempdir = "./air20MM") 14
  • 16. ff, ffbase and ffbase2 packages, Example • Import the data to R > air20MM.ff <- read.table.ffdf(file="airline_20MM.csv", + sep=",", VERBOSE=TRUE, + header=TRUE, next.rows=400000, + colClasses=NA) read.table.ffdf 1..400000 (400000) csv-read=3.224sec ffdf-write=0.397sec read.table.ffdf 400001..800000 (400000) csv-read=3.174sec ffdf-write=0.205sec read.table.ffdf 800001..1200000 (400000) csv-read=3.033sec ffdf-write=0.198sec ... ... read.table.ffdf 20000001..20000000 (0) csv-read=0.045sec csv-read=141.953sec ffdf-write=67.208sec TOTAL=209.161sec • Memory size, dimension > format(object.size(air20MM.ff),units = "MB") [1] "0.1 Mb" > class(air20MM.ff) [1] "ffdf" > dim(air20MM.ff) [1] 20000000 26 • One binary file for each variable > list.files("./air20MM") [1] "ffdf2c9103fa5e4.ff" "ffdf2c915cd46aa.ff" "ffdf2c919345992.ff" [4] "ffdf2c919f020c5.ff" "ffdf2c91b4e0b28.ff" "ffdf2c91fdfba1f.ff" [7] "ffdf2c920be7d19.ff" "ffdf2c922e00bb9.ff" "ffdf2c92321b092.ff" [10] "ffdf2c9263bfa45.ff" .... 15
  • 17. ff, ffbase and ffbase2 packages, Example • Size of the binary files (80 Mb each) > file.size("./air20MM/ffdf2c9103fa5e4.ff") [1] 8e+07 • The binary file of a given variable > basename(filename(air20MM.ff$DayOfWeek)) [1] "ffdf2c92babdb9f.ff" • Many other operations: • Saving and loading ff objects, • Compute tables with table.ff, • Converting a numeric vector to a factor with cut.ff, • Value matching with ffmatch • bigglm.ffdf for Generalized Linear Model (GLM) ...and many others!! 16
  • 18. bigmemory, Example • Reading big matrices > ptm<-proc.time() > air20MM.matrix <- read.big.matrix("airline_20MM.csv", + type ="integer", header = TRUE, backingfile = "air20MM.bin", + descriptorfile ="air20MM.desc", extraCols =NULL) > proc.time()-ptm user system elapsed 109.665 2.425 113.741 • Size, dimensions. > dim(air20MM.matrix) [1] 2.0e+07 2.6e+01 > object.size(air20MM.matrix) 696 bytes • Files. > file.exists("air20MM.desc") [1] TRUE > file.exists("air20MM.bin") [1] TRUE > file.size("air20MM.desc") [1] 753 > file.size("air20MM.bin")/1024^3 [1] 1.937151 17
  • 20. Apache Spark • Speed: Runs workloads 100x faster. • Easily operable writing applications quickly in Java, Scala, Python, R, and SQL. • Combine SQL, streaming, and complex analytics. 19
  • 21. sparklyr: R interface for Apache Spark • Connect to Spark from R. The sparklyr package provides a complete dplyr backend. • Filter and aggregate Spark datasets then bring them into R for analysis and visualization. • Use Spark’s distributed machine learning library from R. Create extensions that call the full Spark API and provide interfaces to Spark packages. 20
  • 25. Managing data in Spark from R • Copying data from R to Spark: dplyr package > library(dplyr) > iris_tbl <- copy_to(sc, iris) • Reading csv files > airline_20MM_sp <- spark_read_csv(sc, "airline_20MM", "airline_20MM.csv") • Munging and Managing data on Spark from R: quickly getting statistics on Massive data. • Execute SQL queries directly against tables within a Spark cluster. > library(DBI) > query1 <- dbGetQuery(con, "SELECT * FROM airline_20MM WHERE MONTH = 9") 22
  • 26. Managing data in Spark from R • Machine Learning procedures on Spark: • ml_decision_tree for decision trees • ml_linear_regression for regression models • ml_gaussian_mixture for fitting Gaussian mixture distributions and EM algorithm • .... • Example > mtcars_tbl <- copy_to(sc, mtcars) > partitions <- mtcars_tbl %>% + filter(hp >= 100) %>% + mutate(cyl8 = cyl == 8) %>% + sdf_partition(training = 0.5, test = 0.5, seed = 1099) > fit <- partitions$training %>% + ml_linear_regression(response = "mpg", features = c("wt", + "cyl")) 23
  • 27. More things to do on Spark from R • Reading and Writing Data : CSV, JSON, and Parquet formats: spark_write_csv, spark_write_parquet, spark_write_json • Execute arbitrary R code across your cluster using spark_apply > spark_apply(iris_tbl, function(data) { + data[1:4] + rgamma(1,2) + }) • View the Spark web console using the spark_web function: > spark_web(sc) 24
  • 28. H2O • Software for machine learning and data analysis. • Ease of Use • Open source (the liberal Apache license) • Easy to use Scalable to big data • Well-documented and commercially supported. • Website: https://www.h2o.ai/h2o/ 25
  • 29. How to install H2O?2 It takes few minutes, ∼ 134 Mb to download. # The following two commands remove any previously installed H2O packages for R. if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) } if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") } # Next, we download packages that H2O depends on. pkgs <- c("RCurl","jsonlite") for (pkg in pkgs) { if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) } } # Now we download, install and initialize the H2O package for R. install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/R") # Finally, let's load H2O and start up an H2O cluster library(h2o) h2o.init() 2Procedure available in http://h2o-release.s3.amazonaws.com/h2o/rel-wright/2/index.html 26
  • 30. Munging data and ML in H2O from R • Importing data files h2o.importFile • Importing multiple files h2o.importFolder • Combining data sets by columns and rows h2o.cbind and h2o.rbind • Group one or more columns and apply a function to the result group_by • Imputing missing values h2o.impute • And the most important Machine Learning algorithms: PCA, Random Forests, Regression Models and Classifications, Gradient Boosting Machine.... 27
  • 31. Hadoop and RHadoop RHadoop is a collection of five R: • rhdfs: Basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS from within R • rhbase: Basic connectivity to the HBASE distributed database. • plyrmr: Data manipulation operations. • rmr2: Allows R developer to perform statistical analysis in R via Hadoop MapReduce functionality on a Hadoop cluster. • ravro: Read and write avro files from local and HDFS file system 28