SlideShare a Scribd company logo
1 of 53
S c alable Data
     A nalys is in R


L ee E . E dlefs en
C hief S c ientis t
Webinar Oc tober 26, 2011


                            1
Introduc tion                                  Revolution Confidential




 Our ability to collect and store data has rapidly
been outpacing our ability to analyze it
 We need scalable data analysis software
 R is the ideal platform for such software:
universal data language, easy to add new
functionality, powerful, flexible, forgiving
 I will discuss the approach to scalability we have
taken at Revolution Analytics with our package
RevoScaleR
                                                                 2
R evoS c aleR pac kage                     Revolution Confidential




 Part of Revolution R Enterprise
 Provides data management and data
  analysis functionality
 Scales from small to huge data
 Is internally threaded to use multiple cores
 Distributes computations across multiple
  computers (in Version 5.0 beta on Windows)
 Revolution R Enterprise is free for academic
  use
                 Revolution R Enterprise                     3
Overview of my talk                           Revolution Confidential




 What do I mean by scalability?
 Why bother?
 Revolution’s approach to scalability:
     R code stays the same as much as possible
     “Chunking” – operate on chunks of data
     Parallel external memory algorithms
     Implementation issues
 Benchmarks

                    Revolution R Enterprise                     4
S c alability                              Revolution Confidential




 From small in-memory data.frames to multi-
  terabyte data sets distributed across space
  and even time
 From single cores on a single computer to
  multiple cores on multiple computers/nodes
 From local hardware to remote clouds
 For both data management and data
  analysis

                 Revolution R Enterprise                     5
K eys to s c alability                     Revolution Confidential




 Most importantly, must be able to process
  more data than can fit into memory on one
  computer at one time
 Process data in “chunks”
 Requires “external memory algorithms”
 Must be fast
 Must be capable of working on multiple
  platforms

                 Revolution R Enterprise                     6
Huge data s ets are bec oming c ommon         Revolution Confidential




 Digital data is not only cheaper to store, it is
  cheaper to collect
   More data is collected digitally
   Increasingly more data of all types is being put
    directly into data bases
 It is easier now to merge and clean data, and
  there is a big and growing market for such
  data bases


                    Revolution R Enterprise                     7
Huge benefits to huge data                              Revolution Confidential




 More information, more to be learned
 Variables and relationships can be visualized
  and analyzed in much greater detail
 Can allow the data to speak for itself; can
  relax or eliminate assumptions
 Can get better predictions and better
  understandings of effects
   For more on this see the Revolution Analytics white paper at
                      http://bit.ly/biganalytics
                        Revolution R Enterprise                           8
P oll Ques tion                             Revolution Confidential




 For you, how big is big data?




                  Revolution R Enterprise                     9
Two huge problems : c apac ity and s peed      Revolution Confidential




 Capacity: problems handling the size of data
  sets or models
   Data too big to fit into memory
   Even if it can fit, there are limits on what can be
    done
   Even simple data management can be
    extremely challenging
 Speed: even without a capacity limit,
  computation may be too slow to be useful

                    10
We are c urrently inc apable of analyzing a
lot of the data we have                      Revolution Confidential




 The most commonly-used statistical software
  tools either fail completely or are too slow to
  be useful on huge data sets
 In many ways we are back where we were in
  the ‘70s and ‘80’s in terms of ability to handle
  common data sizes
 Fortunately, this is changing


                   Revolution R Enterprise                   11
R equires s oftware s olutions             Revolution Confidential




 For decades the rising tide of technology has
  allowed the same data analysis code to run
  faster and on bigger data sets
 That happy era has ended
 The size of data sets is increasing much
  more rapidly than the speed of single cores,
  of I/O, and of RAM
 Need software that can use multiple cores,
  multiple hard drives, and multiple computers

                 Revolution R Enterprise                   12
High P erformanc e A nalytic s : HPA            Revolution Confidential




 HPA is HPC + Data
 High Performance Computing is CPU centric
   Lots of processing on small amounts of data
   Focus is on cores
 High Performance Analytics is data centric
   Less processing per amount of data
   Focus is on feeding data to the cores
      On disk I/O
      On efficient threading, data management in RAM

                     Revolution R Enterprise                    13
R evolution’s approac h to HPA and
s c alability in the R evoS c aleR pac kage  Revolution Confidential




 User interface
   Set of R functions (but mostly implemented in
    C++)
   Keep things as familiar as possible
 Capacity and speed
   Handle data in large “chunks” – blocks of rows
    and columns
   Use parallel external memory algorithms


                   Revolution R Enterprise                   14
S ome interfac e des ign goals              Revolution Confidential




 Should be able to run the same analysis
  code on a small data.frame in memory and
  on a huge distributed data file on a remote
  cluster
 Should be able to do data transformations
  using standard R language on standard R
  objects
 Should be able to use the R formula
  language for analysis
                  Revolution R Enterprise                   15
S ample c ode for logit on laptop         Revolution Confidential




# # Standard formula language
# # Allow transformations in the formula or
# in a function or a list of expressions
# # Row selections also allowed (not shown)

rxLogit(ArrDelay>15 ~ Origin + Year +
  Month + DayOfWeek + UniqueCarrier +
  F(CRSDepTime), data=airData)

                Revolution R Enterprise                   16
S ample c ode for logit on a c lus ter    Revolution Confidential




# Just change the “compute context”
rxOptions(computeContext = myClust)

# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
  Month + DayOfWeek + UniqueCarrier +
  F(CRSDepTime), data=airData)


                Revolution R Enterprise                   17
S ample data trans formation c ode         Revolution Confidential




# # Standard R code on standard R objects
# # Function or a list of expressions
# # In separate data step or in analysis call

 transforms <- list(
DepTime = ConvertToDecimalTime(DepTime),
Late = ArrDelay > 15
)
                 Revolution R Enterprise                   18
C hunking                                   Revolution Confidential




 Operate blocks of rows for selected columns
 Huge data can be processed a chunk at a
  time in a fixed amount of RAM
 Bigger is better up to a point
   Makes it possible to take maximum advantage of
    disk bandwidth
   Minimizes threading overhead
   Can take maximum advantage of R’s vectorized
    code

                  Revolution R Enterprise                   19
T he bas is for a s olution for c apac ity, s peed,
dis tributed and s treaming data – P E MA’sRevolution Confidential




 Parallel external memory algorithms
  (PEMA’s) allow solution of both capacity and
  speed problems, and can deal with
  distributed and streaming data
 External memory algorithms are those that
  allow computations to be split into pieces so
  that not all data has to be in memory at one
  time
 It is possible to “automatically” parallelize
  and distribute such algorithms
                   20
E xternal memory algorithms are widely
available                                    Revolution Confidential




 Some statistics packages originating in the 70’s
  and 80’s, such as SAS and Gauss, were based
  almost entirely on external memory algorithms
 Examples of external memory algorithms:
   data management (transformations, new variables,
    subsets, sorts, merges, converting to factors)
   descriptive statistics, cross tabulations
   linear models, glm, mixed effects
   prediction/scoring
   many maximum likelihood and other optimization
    algorithms
   many clustering procedures, tree models

                   21
P arallel algorithms are widely available
                                           Revolution Confidential


   Parallel algorithms are those that allow
    different portions of a computation to be
    done “at the same time” using different
    cores
   There has been an lot of research on
    parallel computing over the past 20-30
    years, and many helpful tools are now
    available



                   22
Not muc h literature on P E MA’s            Revolution Confidential




 Unfortunately, there is not much literature on
  Parallel External Memory Algorithms,
  especially for statistical computations
 In what follows, I outline an approach to
  PEMA’s for doing statistical computations




                  Revolution R Enterprise                   23
P arallelizing external memory algorithmsRevolution Confidential




 Parallel algorithms split a job into tasks that
  can be run simultaneously
 External memory algorithms split a job into
  tasks that operate on separate blocks data
 External memory algorithms usually are run
  sequentially, but with proper care most can be
  parallelized efficiently and “automatically”
 The key is to split the tasks appropriately and
  to minimize inter-thread communication and
  synchronization

                  24
P arallel C omputations with
S ync hronization and C ommunic ation
                                    Revolution Confidential




                25
E mbarras s ingly P arallel C omputations
                                   Revolution Confidential




               26
S equential E xternal Memory C omputations
                                    Revolution Confidential




                27
P arallel E xternal Memory C omputations
                                     Revolution Confidential




                28
E xample external memory algorithm for the
mean of a variable                      Revolution Confidential




 Initialization task: total=0, count=0
 Process data task: for each block of x; total
  = sum(x), count=length(x)
 Update results task: combined total =
  sum(all totals), combined count = sum(all
  counts)
 Process results task: mean = combined
  total / combined count

                  29
P E MA’s in R evoS c aleR                    Revolution Confidential




 Analytics algorithms are implemented in a
  framework that automatically and efficiently
  parallelizes code
 Key is arranging code in virtual functions:
   Initialize: allows initializations
   ProcessData: process a chunk of data at a time
   UpdateResults: updates one set of results from
    another
   ProcessResults: any final processing

                   Revolution R Enterprise                   30
P E MA’s in R evoS c aleR (2)               Revolution Confidential




 Analytic code is completely independent of:
   Data path code that feeds data to ProcessData
   Threading code for using multiple cores
   Inter-computer communication code for
    distributing across nodes
 Communication among machines is highly
  abstracted (currently can use MPI, RPC)
 Implemented in C++ now, but we plan to add
  an implementation in R
                  Revolution R Enterprise                   31
S toring and reading data                   Revolution Confidential




 ScaleR can process data from a variety of
  sources
 Has its own optimized format (XDF) that is
  especially suitable for chunking
 Allows rapid access to blocks of rows for
  selected columns
 The time it takes to read a block of rows is
  essentially independent of the total number
  of rows and columns in the file
                  Revolution R Enterprise                   32
T he XDF file format                           Revolution Confidential




 Data is stored in blocks of rows per column
 “Header” information is stored at the end of
  the file
 Allows sequential reads; tens to hundreds of
  thousands of times faster than random reads
 Essentially unlimited in size
   64 bit row indexing per column
   64 bit column indexing (but the practical limit is
    much smaller)
                    Revolution R Enterprise                    33
T he XDF file format (2)                    Revolution Confidential




 Allows wide range of storage formats (1 byte
  to 8 byte signed and unsigned integers; 4
  and 8 byte floats; variable length strings)
 Both new rows and new columns can be
  added to a file without having to rewrite the
  file
 Changes to header information (variable
  names, descriptions, factor levels, and so
  on) are extremely fast
                  Revolution R Enterprise                   34
Overview of data path on a c omputer        Revolution Confidential




 DataSource object reads a chunk of data
  into a DataSet object on I/O thread
 DataSet is given to transformation code
  (data copied to R for R transformations);
  variables and rows may be created, removed
 Transformed DataSet is virtually split across
  computational cores and passed to
  ProcessData methods on different threads
 Any disk output is stored until I/O thread can
  write it to output DataSource
                  Revolution R Enterprise                   35
Handling data in memory                     Revolution Confidential




 Use of appropriate storage format reduces
  space and reduces time to move data in
  memory
 Data copying and conversion is minimized
 For instance, when adding a vector of
  unsigned shorts to a vector of doubles, the
  smaller type is not converted until loaded
  into the CPU


                  Revolution R Enterprise                   36
Us e of multiple c ores per c omputer       Revolution Confidential




 Code is internally “threaded” so that inter-
  process communication and data transfer is
  not required
 One core (typically) handles I/O, while the
  other cores process data from the previous
  read
 Data is virtually split across computational
  cores; each core thinks it has its own private
  copy
                  Revolution R Enterprise                   37
R evoS c aleR – Multi-T hreaded P roc es s ing                                                         Revolution Confidential



                                                              Shared Memory

                                                           Data                  Data                    Data



                                   Core 0                 Core 1                Core 2                 Core n
     Disk                          (Thread 0)             (Thread 1)            (Thread 2)             (Thread n)


                                                Multicore Processor (4, 8, 16+ cores)


                                                                 RevoScaleR

 •    A RevoScaleR algorithm is provided a data source as input
 •    The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread
      (Thread 0).
 •    Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update
      intermediate results objects in memory
 •    When all of the data is processed a master results object is created from the intermediate results objects
Us e of multiple c omputers                 Revolution Confidential




 Key to efficiency is minimizing data transfer
  and communication
 Locality of data!
 For PEMA’s, the master node controls
  computations, telling workers where to get
  data and what computations to do
 Intermediate results on each node are
  aggregated across cores
 Master node gathers all results, checks for
  convergence, and repeats if necessary
                  Revolution R Enterprise                   39
R evoS c aleR – Dis tributed C omputing                     Revolution Confidential



              Compute
  Data         Node
 Partition   (RevoScaleR)
                                           •   Portions of the data source are
                                               made available to each compute
              Compute                          node
  Data         Node
 Partition                                 •   RevoScaleR on the master node
             (RevoScaleR)
                             Master            assigns a task to each compute
                                               node
                             Node
              Compute       (RevoScaleR)
  Data                                     •   Each compute node independently
               Node                            processes its data, and returns it’s
 Partition   (RevoScaleR)                      intermediate results back to the
                                               master node
              Compute
  Data                                     •   master node aggregates all of the
               Node                            intermediate results from each
 Partition   (RevoScaleR)                      compute node and produces the
                                               final result
Data management c apabilities                 Revolution Confidential




   Import from external sources
   Transform and clean variables
   Code and recode factors
   Missing values
   Validation
   Sort (huge data, but not distributed)
   Merge
   Aggregate, summarize
                    Revolution R Enterprise                   41
A nalys is algorithms                       Revolution Confidential




 Descriptive statistics (rxSummary)
 Tables and cubes (rxCube, rxCrossTabs)
 Correlations/covariances (rxCovCor, rxCor,
  rxCov, rxSSCP)
 Linear regressions (rxLinMod)
 Logistic regressions (rxLogit)
 K means clustering (rxKmeans)
 Predictions (scoring) (rxPredict)
 Other algorithms are being developed

                  Revolution R Enterprise                   42
C omparative benc hmark vs S A S            Revolution Confidential




 In April 2011 SAS, “the undisputed leader in
  advanced analytics,” announced that it is
  developing advanced “high performance
  analytics” software
 Targeted to run on extremely expensive MPP
  database platforms from Greenplum and
  Teradata
 Key design feature is that all data must be in
  memory; thus huge data requires huge RAM
                  Revolution R Enterprise                   43
C omparis on with S A S HPA : logis tic
regres s ion                                                          Revolution Confidential



                   SAS and Greenplum                ScaleR on commodity cluster
     Nodes                  32                                   5
     Cores                 384                                   20
      RAM               1,536 GB                              80 GB
  Parameters           “just a few”                              7
  Data location        In memory                              On disk
  Rows of data           1 billion                            1 billion
      Time             80 seconds                           44 seconds

Revo R is faster on the same amount of data, despite using
approximately a 20th as many cores, a 20th as much RAM, a 6th as
many nodes, and not pre-loading data into RAM.


                          Revolution R Enterprise                                     44
More C omparative benc hmarks                         Revolution Confidential




 SAS HPA benchmarks:
   Logit, billion rows, 32 nodes, 384 cores, in-memory,
    “just a few” parameters:             80 secs.
   Regression, 50 million rows, 24 nodes, in-memory,
    1,800 parameters:                    42 secs.
 ScaleR: 1 billion rows, 5 nodes, 20 cores ($5K)
     rxLogit, 7 parameters:                   44 secs
     rxLinMod, 7 parameters:                  4.1 secs
     rxLinMod, 1,848 params:                  11.9 secs
     rxLogit, 1,848 params:                   111.5 secs
     rxLinMod, 13,537 params:                 89.8 secs.
                     Revolution R Enterprise                          45
B enc hmarks us ing the airline data          Revolution Confidential




 Airline on-time performance data produced by U.S.
  Department of Transportation; used in the ASA
  Data Expo 09
 22 CSV files for the years 1987 – 2008
 123.5 million observations, 29 variables, about 13
  GB
 For benchmarks, sliced and replicated to get files
  with 1 million to about 1.25 billion rows
 5 node commodity cluster (4 cores @ 3.2GHz & 16
  GB RAM per node); Windows HPC Server 2008
                    Revolution R Enterprise                   46
Dis tributed Import of A irline Data         Revolution Confidential




 Copy the 22 CSV files to the 5 nodes,
  keeping rough balance in sizes
 Two pass import process:
   Pass1: Import/clean/transform/append

   Pass 2: Recode factors whose levels differ

 Import time: about 3 min 20 secs on 5 node
  cluster (about 17 minutes on single node)

                   Revolution R Enterprise                   47
S c alability of R evoS c aleR with R ows
R egres s ion, 1 million - 1.1 billion rows , 443 betas
                                                Revolution Confidential

                             (4 c ore laptop)




                    Revolution R Enterprise                     48
S c alability of R evoS c aleR with Nodes
  R egres s ion, 1 billion rows , 443 betas
                                          Revolution Confidential

                (1 to 5 nodes , 4 c ores per node)




                    Revolution R Enterprise                  49
S c alability of R evoS c aleR with Nodes
                                                      Revolution Confidential
L ogit, 123.5 million rows , 443 betas , 5 iterations
                 (1 to 5 nodes , 4 c ores per node)




                    Revolution R Enterprise                           50
S c alability of R evoS c aleR with Nodes
L ogit, 1 billion rows , 443 betas , 5 iterations Confidential
                                             Revolution

                 (1 to 5 nodes , 4 c ores per node)




                    Revolution R Enterprise               51
P oll Ques tion                             Revolution Confidential




 Which big data problem do you most need to
  solve?




                  Revolution R Enterprise                   52
C ontac t information           Revolution Confidential




      Lee Edlefsen
lee@revolutionanalytics.com




        Revolution R Enterprise                   53

More Related Content

What's hot

Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09Hiroshi Ono
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyData
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combinersijcsit
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkeldariof
 

What's hot (20)

Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
43_Sameer_Kumar_Das2
43_Sameer_Kumar_Das243_Sameer_Kumar_Das2
43_Sameer_Kumar_Das2
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
A sql implementation on the map reduce framework
A sql implementation on the map reduce frameworkA sql implementation on the map reduce framework
A sql implementation on the map reduce framework
 

Viewers also liked (11)

Researching Semantic Web-Overview
Researching Semantic Web-OverviewResearching Semantic Web-Overview
Researching Semantic Web-Overview
 
MOLDEAS-PhD Summary
MOLDEAS-PhD SummaryMOLDEAS-PhD Summary
MOLDEAS-PhD Summary
 
Introducción a Sistemas de Información
Introducción a Sistemas de InformaciónIntroducción a Sistemas de Información
Introducción a Sistemas de Información
 
Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0Internet, Web 2.0 y Salud 2.0
Internet, Web 2.0 y Salud 2.0
 
HTML5-Aplicaciones web
HTML5-Aplicaciones webHTML5-Aplicaciones web
HTML5-Aplicaciones web
 
HTML5 Audio & Vídeo
HTML5 Audio & VídeoHTML5 Audio & Vídeo
HTML5 Audio & Vídeo
 
QoS Management in Cloud Computing-Draft proposal
QoS Management in Cloud Computing-Draft proposalQoS Management in Cloud Computing-Draft proposal
QoS Management in Cloud Computing-Draft proposal
 
Ejemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en SaludEjemplos prácticos de Búsqueda en Salud
Ejemplos prácticos de Búsqueda en Salud
 
Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"Introducción a "La Web como una Base de Datos"
Introducción a "La Web como una Base de Datos"
 
MOLDEAS at City College
MOLDEAS at City CollegeMOLDEAS at City College
MOLDEAS at City College
 
WP4-QoS Management in the Cloud
WP4-QoS Management in the CloudWP4-QoS Management in the Cloud
WP4-QoS Management in the Cloud
 

Similar to Scalable Data Analysis in R Webinar Presentation

useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsenrusersla
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopDataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisRevolution Analytics
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Revolution Analytics
 
100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0Revolution Analytics
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 

Similar to Scalable Data Analysis in R Webinar Presentation (20)

useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobs
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 

Scalable Data Analysis in R Webinar Presentation

  • 1. S c alable Data A nalys is in R L ee E . E dlefs en C hief S c ientis t Webinar Oc tober 26, 2011 1
  • 2. Introduc tion Revolution Confidential  Our ability to collect and store data has rapidly been outpacing our ability to analyze it  We need scalable data analysis software  R is the ideal platform for such software: universal data language, easy to add new functionality, powerful, flexible, forgiving  I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR 2
  • 3. R evoS c aleR pac kage Revolution Confidential  Part of Revolution R Enterprise  Provides data management and data analysis functionality  Scales from small to huge data  Is internally threaded to use multiple cores  Distributes computations across multiple computers (in Version 5.0 beta on Windows)  Revolution R Enterprise is free for academic use Revolution R Enterprise 3
  • 4. Overview of my talk Revolution Confidential  What do I mean by scalability?  Why bother?  Revolution’s approach to scalability:  R code stays the same as much as possible  “Chunking” – operate on chunks of data  Parallel external memory algorithms  Implementation issues  Benchmarks Revolution R Enterprise 4
  • 5. S c alability Revolution Confidential  From small in-memory data.frames to multi- terabyte data sets distributed across space and even time  From single cores on a single computer to multiple cores on multiple computers/nodes  From local hardware to remote clouds  For both data management and data analysis Revolution R Enterprise 5
  • 6. K eys to s c alability Revolution Confidential  Most importantly, must be able to process more data than can fit into memory on one computer at one time  Process data in “chunks”  Requires “external memory algorithms”  Must be fast  Must be capable of working on multiple platforms Revolution R Enterprise 6
  • 7. Huge data s ets are bec oming c ommon Revolution Confidential  Digital data is not only cheaper to store, it is cheaper to collect  More data is collected digitally  Increasingly more data of all types is being put directly into data bases  It is easier now to merge and clean data, and there is a big and growing market for such data bases Revolution R Enterprise 7
  • 8. Huge benefits to huge data Revolution Confidential  More information, more to be learned  Variables and relationships can be visualized and analyzed in much greater detail  Can allow the data to speak for itself; can relax or eliminate assumptions  Can get better predictions and better understandings of effects For more on this see the Revolution Analytics white paper at http://bit.ly/biganalytics Revolution R Enterprise 8
  • 9. P oll Ques tion Revolution Confidential  For you, how big is big data? Revolution R Enterprise 9
  • 10. Two huge problems : c apac ity and s peed Revolution Confidential  Capacity: problems handling the size of data sets or models  Data too big to fit into memory  Even if it can fit, there are limits on what can be done  Even simple data management can be extremely challenging  Speed: even without a capacity limit, computation may be too slow to be useful 10
  • 11. We are c urrently inc apable of analyzing a lot of the data we have Revolution Confidential  The most commonly-used statistical software tools either fail completely or are too slow to be useful on huge data sets  In many ways we are back where we were in the ‘70s and ‘80’s in terms of ability to handle common data sizes  Fortunately, this is changing Revolution R Enterprise 11
  • 12. R equires s oftware s olutions Revolution Confidential  For decades the rising tide of technology has allowed the same data analysis code to run faster and on bigger data sets  That happy era has ended  The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM  Need software that can use multiple cores, multiple hard drives, and multiple computers Revolution R Enterprise 12
  • 13. High P erformanc e A nalytic s : HPA Revolution Confidential  HPA is HPC + Data  High Performance Computing is CPU centric  Lots of processing on small amounts of data  Focus is on cores  High Performance Analytics is data centric  Less processing per amount of data  Focus is on feeding data to the cores  On disk I/O  On efficient threading, data management in RAM Revolution R Enterprise 13
  • 14. R evolution’s approac h to HPA and s c alability in the R evoS c aleR pac kage Revolution Confidential  User interface  Set of R functions (but mostly implemented in C++)  Keep things as familiar as possible  Capacity and speed  Handle data in large “chunks” – blocks of rows and columns  Use parallel external memory algorithms Revolution R Enterprise 14
  • 15. S ome interfac e des ign goals Revolution Confidential  Should be able to run the same analysis code on a small data.frame in memory and on a huge distributed data file on a remote cluster  Should be able to do data transformations using standard R language on standard R objects  Should be able to use the R formula language for analysis Revolution R Enterprise 15
  • 16. S ample c ode for logit on laptop Revolution Confidential # # Standard formula language # # Allow transformations in the formula or # in a function or a list of expressions # # Row selections also allowed (not shown) rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 16
  • 17. S ample c ode for logit on a c lus ter Revolution Confidential # Just change the “compute context” rxOptions(computeContext = myClust) # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 17
  • 18. S ample data trans formation c ode Revolution Confidential # # Standard R code on standard R objects # # Function or a list of expressions # # In separate data step or in analysis call transforms <- list( DepTime = ConvertToDecimalTime(DepTime), Late = ArrDelay > 15 ) Revolution R Enterprise 18
  • 19. C hunking Revolution Confidential  Operate blocks of rows for selected columns  Huge data can be processed a chunk at a time in a fixed amount of RAM  Bigger is better up to a point  Makes it possible to take maximum advantage of disk bandwidth  Minimizes threading overhead  Can take maximum advantage of R’s vectorized code Revolution R Enterprise 19
  • 20. T he bas is for a s olution for c apac ity, s peed, dis tributed and s treaming data – P E MA’sRevolution Confidential  Parallel external memory algorithms (PEMA’s) allow solution of both capacity and speed problems, and can deal with distributed and streaming data  External memory algorithms are those that allow computations to be split into pieces so that not all data has to be in memory at one time  It is possible to “automatically” parallelize and distribute such algorithms 20
  • 21. E xternal memory algorithms are widely available Revolution Confidential  Some statistics packages originating in the 70’s and 80’s, such as SAS and Gauss, were based almost entirely on external memory algorithms  Examples of external memory algorithms:  data management (transformations, new variables, subsets, sorts, merges, converting to factors)  descriptive statistics, cross tabulations  linear models, glm, mixed effects  prediction/scoring  many maximum likelihood and other optimization algorithms  many clustering procedures, tree models 21
  • 22. P arallel algorithms are widely available Revolution Confidential  Parallel algorithms are those that allow different portions of a computation to be done “at the same time” using different cores  There has been an lot of research on parallel computing over the past 20-30 years, and many helpful tools are now available 22
  • 23. Not muc h literature on P E MA’s Revolution Confidential  Unfortunately, there is not much literature on Parallel External Memory Algorithms, especially for statistical computations  In what follows, I outline an approach to PEMA’s for doing statistical computations Revolution R Enterprise 23
  • 24. P arallelizing external memory algorithmsRevolution Confidential  Parallel algorithms split a job into tasks that can be run simultaneously  External memory algorithms split a job into tasks that operate on separate blocks data  External memory algorithms usually are run sequentially, but with proper care most can be parallelized efficiently and “automatically”  The key is to split the tasks appropriately and to minimize inter-thread communication and synchronization 24
  • 25. P arallel C omputations with S ync hronization and C ommunic ation Revolution Confidential 25
  • 26. E mbarras s ingly P arallel C omputations Revolution Confidential 26
  • 27. S equential E xternal Memory C omputations Revolution Confidential 27
  • 28. P arallel E xternal Memory C omputations Revolution Confidential 28
  • 29. E xample external memory algorithm for the mean of a variable Revolution Confidential  Initialization task: total=0, count=0  Process data task: for each block of x; total = sum(x), count=length(x)  Update results task: combined total = sum(all totals), combined count = sum(all counts)  Process results task: mean = combined total / combined count 29
  • 30. P E MA’s in R evoS c aleR Revolution Confidential  Analytics algorithms are implemented in a framework that automatically and efficiently parallelizes code  Key is arranging code in virtual functions:  Initialize: allows initializations  ProcessData: process a chunk of data at a time  UpdateResults: updates one set of results from another  ProcessResults: any final processing Revolution R Enterprise 30
  • 31. P E MA’s in R evoS c aleR (2) Revolution Confidential  Analytic code is completely independent of:  Data path code that feeds data to ProcessData  Threading code for using multiple cores  Inter-computer communication code for distributing across nodes  Communication among machines is highly abstracted (currently can use MPI, RPC)  Implemented in C++ now, but we plan to add an implementation in R Revolution R Enterprise 31
  • 32. S toring and reading data Revolution Confidential  ScaleR can process data from a variety of sources  Has its own optimized format (XDF) that is especially suitable for chunking  Allows rapid access to blocks of rows for selected columns  The time it takes to read a block of rows is essentially independent of the total number of rows and columns in the file Revolution R Enterprise 32
  • 33. T he XDF file format Revolution Confidential  Data is stored in blocks of rows per column  “Header” information is stored at the end of the file  Allows sequential reads; tens to hundreds of thousands of times faster than random reads  Essentially unlimited in size  64 bit row indexing per column  64 bit column indexing (but the practical limit is much smaller) Revolution R Enterprise 33
  • 34. T he XDF file format (2) Revolution Confidential  Allows wide range of storage formats (1 byte to 8 byte signed and unsigned integers; 4 and 8 byte floats; variable length strings)  Both new rows and new columns can be added to a file without having to rewrite the file  Changes to header information (variable names, descriptions, factor levels, and so on) are extremely fast Revolution R Enterprise 34
  • 35. Overview of data path on a c omputer Revolution Confidential  DataSource object reads a chunk of data into a DataSet object on I/O thread  DataSet is given to transformation code (data copied to R for R transformations); variables and rows may be created, removed  Transformed DataSet is virtually split across computational cores and passed to ProcessData methods on different threads  Any disk output is stored until I/O thread can write it to output DataSource Revolution R Enterprise 35
  • 36. Handling data in memory Revolution Confidential  Use of appropriate storage format reduces space and reduces time to move data in memory  Data copying and conversion is minimized  For instance, when adding a vector of unsigned shorts to a vector of doubles, the smaller type is not converted until loaded into the CPU Revolution R Enterprise 36
  • 37. Us e of multiple c ores per c omputer Revolution Confidential  Code is internally “threaded” so that inter- process communication and data transfer is not required  One core (typically) handles I/O, while the other cores process data from the previous read  Data is virtually split across computational cores; each core thinks it has its own private copy Revolution R Enterprise 37
  • 38. R evoS c aleR – Multi-T hreaded P roc es s ing Revolution Confidential Shared Memory Data Data Data Core 0 Core 1 Core 2 Core n Disk (Thread 0) (Thread 1) (Thread 2) (Thread n) Multicore Processor (4, 8, 16+ cores) RevoScaleR • A RevoScaleR algorithm is provided a data source as input • The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). • Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory • When all of the data is processed a master results object is created from the intermediate results objects
  • 39. Us e of multiple c omputers Revolution Confidential  Key to efficiency is minimizing data transfer and communication  Locality of data!  For PEMA’s, the master node controls computations, telling workers where to get data and what computations to do  Intermediate results on each node are aggregated across cores  Master node gathers all results, checks for convergence, and repeats if necessary Revolution R Enterprise 39
  • 40. R evoS c aleR – Dis tributed C omputing Revolution Confidential Compute Data Node Partition (RevoScaleR) • Portions of the data source are made available to each compute Compute node Data Node Partition • RevoScaleR on the master node (RevoScaleR) Master assigns a task to each compute node Node Compute (RevoScaleR) Data • Each compute node independently Node processes its data, and returns it’s Partition (RevoScaleR) intermediate results back to the master node Compute Data • master node aggregates all of the Node intermediate results from each Partition (RevoScaleR) compute node and produces the final result
  • 41. Data management c apabilities Revolution Confidential  Import from external sources  Transform and clean variables  Code and recode factors  Missing values  Validation  Sort (huge data, but not distributed)  Merge  Aggregate, summarize Revolution R Enterprise 41
  • 42. A nalys is algorithms Revolution Confidential  Descriptive statistics (rxSummary)  Tables and cubes (rxCube, rxCrossTabs)  Correlations/covariances (rxCovCor, rxCor, rxCov, rxSSCP)  Linear regressions (rxLinMod)  Logistic regressions (rxLogit)  K means clustering (rxKmeans)  Predictions (scoring) (rxPredict)  Other algorithms are being developed Revolution R Enterprise 42
  • 43. C omparative benc hmark vs S A S Revolution Confidential  In April 2011 SAS, “the undisputed leader in advanced analytics,” announced that it is developing advanced “high performance analytics” software  Targeted to run on extremely expensive MPP database platforms from Greenplum and Teradata  Key design feature is that all data must be in memory; thus huge data requires huge RAM Revolution R Enterprise 43
  • 44. C omparis on with S A S HPA : logis tic regres s ion Revolution Confidential SAS and Greenplum ScaleR on commodity cluster Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB Parameters “just a few” 7 Data location In memory On disk Rows of data 1 billion 1 billion Time 80 seconds 44 seconds Revo R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. Revolution R Enterprise 44
  • 45. More C omparative benc hmarks Revolution Confidential  SAS HPA benchmarks:  Logit, billion rows, 32 nodes, 384 cores, in-memory, “just a few” parameters: 80 secs.  Regression, 50 million rows, 24 nodes, in-memory, 1,800 parameters: 42 secs.  ScaleR: 1 billion rows, 5 nodes, 20 cores ($5K)  rxLogit, 7 parameters: 44 secs  rxLinMod, 7 parameters: 4.1 secs  rxLinMod, 1,848 params: 11.9 secs  rxLogit, 1,848 params: 111.5 secs  rxLinMod, 13,537 params: 89.8 secs. Revolution R Enterprise 45
  • 46. B enc hmarks us ing the airline data Revolution Confidential  Airline on-time performance data produced by U.S. Department of Transportation; used in the ASA Data Expo 09  22 CSV files for the years 1987 – 2008  123.5 million observations, 29 variables, about 13 GB  For benchmarks, sliced and replicated to get files with 1 million to about 1.25 billion rows  5 node commodity cluster (4 cores @ 3.2GHz & 16 GB RAM per node); Windows HPC Server 2008 Revolution R Enterprise 46
  • 47. Dis tributed Import of A irline Data Revolution Confidential  Copy the 22 CSV files to the 5 nodes, keeping rough balance in sizes  Two pass import process:  Pass1: Import/clean/transform/append  Pass 2: Recode factors whose levels differ  Import time: about 3 min 20 secs on 5 node cluster (about 17 minutes on single node) Revolution R Enterprise 47
  • 48. S c alability of R evoS c aleR with R ows R egres s ion, 1 million - 1.1 billion rows , 443 betas Revolution Confidential (4 c ore laptop) Revolution R Enterprise 48
  • 49. S c alability of R evoS c aleR with Nodes R egres s ion, 1 billion rows , 443 betas Revolution Confidential (1 to 5 nodes , 4 c ores per node) Revolution R Enterprise 49
  • 50. S c alability of R evoS c aleR with Nodes Revolution Confidential L ogit, 123.5 million rows , 443 betas , 5 iterations (1 to 5 nodes , 4 c ores per node) Revolution R Enterprise 50
  • 51. S c alability of R evoS c aleR with Nodes L ogit, 1 billion rows , 443 betas , 5 iterations Confidential Revolution (1 to 5 nodes , 4 c ores per node) Revolution R Enterprise 51
  • 52. P oll Ques tion Revolution Confidential  Which big data problem do you most need to solve? Revolution R Enterprise 52
  • 53. C ontac t information Revolution Confidential Lee Edlefsen lee@revolutionanalytics.com Revolution R Enterprise 53