SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
Scalable Data
     Analysis in R


Lee E. Edlefsen
Chief Scientist
UserR! 2011


                     1
Introduction                                   Revolution Confidential




 Our ability to collect and store data has rapidly
been outpacing our ability to analyze it
 We need scalable data analysis software
 R is the ideal platform for such software:
universal data language, easy to add new
functionality, powerful, flexible, forgiving
 I will discuss the approach to scalability we have
taken at Revolution Analytics with our package
RevoScaleR
                                                                 2
RevoScaleR package                         Revolution Confidential




 Part of Revolution R Enterprise
 Provides data management and data
  analysis functionality
 Scales from small to huge data
 Is internally threaded to use multiple cores
 Distributes computations across multiple
  computers (in version 5.0 beta on Windows)
 Revolution R Enterprise is free for academic
  use
                 Revolution R Enterprise                     3
Overview of my talk                           Revolution Confidential




 What do I mean by scalability?
 Why bother?
 Revolution‟s approach to scalability:
     R code stays the same as much as possible
     “Chunking” – operate on chunks of data
     Parallel external memory algorithms
     Implementation issues
 Benchmarks

                    Revolution R Enterprise                     4
Scalability                                Revolution Confidential




 From small in-memory data.frames to multi-
  terabyte data sets distributed across space
  and even time
 From single cores on a single computer to
  multiple cores on multiple computers/nodes
 From local hardware to remote clouds
 For both data management and data
  analysis

                 Revolution R Enterprise                     5
Keys to scalability                         Revolution Confidential




 Most importantly, must be able to process
  more data than can fit into memory on one
  computer at one time
 Process data in “chunks”
 Up to a point, the bigger the chunk the better
   Allows faster sequential reads from disk
   Minimizes thread overhead in memory
   Takes maximum advantage of R‟s vectorized
    code
 Requires “external memory algorithms”
                  Revolution R Enterprise                     6
Huge data sets are becoming common            Revolution Confidential




 Digital data is not only cheaper to store, it is
  cheaper to collect
   More data is collected digitally
   Increasingly more data of all types is being put
    directly into data bases
 It is easier now to merge and clean data, and
  there is a big and growing market for such
  data bases


                    Revolution R Enterprise                     7
Huge benefits to huge data                  Revolution Confidential




 More information, more to be learned
 Variables and relationships can be visualized
  and analyzed in much greater detail
 Can allow the data to speak for itself; can
  relax or eliminate assumptions
 Can get better predictions and better
  understandings of effects

                  Revolution R Enterprise                     8
Two huge problems: capacity and speed          Revolution Confidential




 Capacity: problems handling the size of data
  sets or models
   Data too big to fit into memory
   Even if it can fit, there are limits on what can be
    done
   Even simple data management can be
    extremely challenging
 Speed: even without a capacity limit,
  computation may be too slow to be useful

                    9
We are currently incapable of analyzing a
lot of the data we have                      Revolution Confidential




 The most commonly-used statistical software
  tools either fail completely or are too slow to
  be useful on huge data sets
 In many ways we are back where we were in
  the „70s and „80‟s in terms of ability to handle
  common data sizes
 Fortunately, this is changing


                   Revolution R Enterprise                    10
Requires software solutions                Revolution Confidential




 For decades the rising tide of technology has
  allowed the same data analysis code to run
  faster and on bigger data sets
 That happy era has ended
 The size of data sets is increasing much
  more rapidly than the speed of single cores,
  of I/O, and of RAM
 Need software that can use multiple cores,
  multiple hard drives, and multiple computers

                 Revolution R Enterprise                    11
High Performance Analytics: HPA                 Revolution Confidential




 HPA is HPC + Data
 High Performance Computing is CPU centric
   Lots of processing on small amounts of data
   Focus is on cores
 High Performance Analytics is data centric
   Less processing per amount of data
   Focus is on feeding data to the cores
      On disk I/O
      On efficient threading, data management in RAM

                     Revolution R Enterprise                     12
Revolution’s approach to HPA and
scalability in the RevoScaleR package        Revolution Confidential




 User interface
   Set of R functions (but mostly implemented in
    C++)
   Keep things as familiar as possible
 Capacity and speed
   Handle data in large “chunks” – blocks of rows
    and columns
   Use parallel external memory algorithms


                   Revolution R Enterprise                    13
Some interface design goals                 Revolution Confidential




 Should be able to run the same analysis
  code on a small data.frame in memory and
  on a huge distributed data file on a remote
  cluster
 Should be able to do data transformations
  using standard R language on standard R
  objects
 Should be able to use the R formula
  language for analysis
                  Revolution R Enterprise                    14
Sample code for logit on laptop           Revolution Confidential




# # Standard formula language
# # Allow transformations in the formula or
# in a function or a list of expressions
# # Row selections also allowed (not shown)

rxLogit(ArrDelay>15 ~ Origin + Year +
  Month + DayOfWeek + UniqueCarrier +
  F(CRSDepTime), data=airData)

                Revolution R Enterprise                    15
Sample code for logit on a cluster       Revolution Confidential




# Just change the “compute context”
rxOptions(computeContext = myClust)

# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
  Month + DayOfWeek + UniqueCarrier +
  F(CRSDepTime), data=airData)


               Revolution R Enterprise                    16
Sample data transformation code            Revolution Confidential




# # Standard R code on standard R objects
# # Function or a list of expressions
# # In separate data step or in analysis call

 transforms <- list(
DepTime = ConvertToDecimalTime(DepTime),
Late = ArrDelay > 15
)
                 Revolution R Enterprise                    17
Chunking                                    Revolution Confidential




 Operate blocks of rows for selected columns
 Huge data can be processed a chunk at a
  time in a fixed amount of RAM
 Bigger is better up to a point
   Makes it possible to take maximum advantage of
    disk bandwidth
   Minimizes threading overhead
   Can take maximum advantage of R‟s vectorized
    code

                  Revolution R Enterprise                    18
The basis for a solution for capacity, speed,
distributed and streaming data – PEMA’s Revolution Confidential




 Parallel external memory algorithms
  (PEMA’s) allow solution of both capacity and
  speed problems, and can deal with
  distributed and streaming data
 External memory algorithms are those that
  allow computations to be split into pieces so
  that not all data has to be in memory at one
  time
 It is possible to “automatically” parallelize
  and distribute such algorithms
                  19
External memory algorithms are widely
available                                    Revolution Confidential




 Some statistics packages originating in the 70‟s
  and 80‟s, such as SAS and Gauss, were based
  almost entirely on external memory algorithms
 Examples of external memory algorithms:
   data management (transformations, new variables,
    subsets, sorts, merges, converting to factors)
   descriptive statistics, cross tabulations
   linear models, glm, mixed effects
   prediction/scoring
   many maximum likelihood and other optimization
    algorithms
   many clustering procedures, tree models

                   20
Parallel algorithms are widely available
                                           Revolution Confidential


   Parallel algorithms are those that allow
    different portions of a computation to be
    done “at the same time” using different
    cores
   There has been an lot of research on
    parallel computing over the past 20-30
    years, and many helpful tools are now
    available



                   21
Not much literature on PEMA’s               Revolution Confidential




 Unfortunately, there is not much literature on
  Parallel External Memory Algorithms,
  especially for statistical computations
 In what follows, I outline an approach to
  PEMA‟s for doing statistical computations




                  Revolution R Enterprise                    22
Parallelizing external memory algorithms Revolution Confidential




 Parallel algorithms split a job into tasks that
  can be run simultaneously
 External memory algorithms split a job into
  tasks that operate on separate blocks data
 External memory algorithms usually are run
  sequentially, but with proper care most can be
  parallelized efficiently and “automatically”
 The key is to split the tasks appropriately and
  to minimize inter-thread communication and
  synchronization

                  23
Parallel Computations with
Synchronization and Communication
                                Revolution Confidential




              24
Embarrassingly Parallel Computations
                              Revolution Confidential




             25
Sequential External Memory Computations
                                 Revolution Confidential




               26
Parallel External Memory Computations
                                  Revolution Confidential




               27
Example external memory algorithm for the
mean of a variable                      Revolution Confidential




 Initialization task: total=0, count=0
 Process data task: for each block of x; total
  = sum(x), count=length(x)
 Update results task: combined total =
  sum(all totals), combined count = sum(all
  counts)
 Process results task: mean = combined
  total / combined count

                  28
PEMA’s in RevoScaleR                         Revolution Confidential




 Analytics algorithms are implemented in a
  framework that automatically and efficiently
  parallelizes code
 Key is arranging code in virtual functions:
   Initialize: allows initializations
   ProcessData: process a chunk of data at a time
   UpdateResults: updates one set of results from
    another
   ProcessResults: any final processing

                   Revolution R Enterprise                    29
PEMA’s in RevoScaleR (2)                    Revolution Confidential




 Analytic code is completely independent of:
   Data path code that feeds data to ProcessData
   Threading code for using multiple cores
   Inter-computer communication code for
    distributing across nodes
 Communication among machines is highly
  abstracted (currently can use MPI, RPC)
 Implemented in C++ now, but we plan to add
  an implementation in R
                  Revolution R Enterprise                    30
Storing and reading data                    Revolution Confidential




 ScaleR can process data from a variety of
  sources
 Has its own optimized format (XDF) that is
  especially suitable for chunking
 Allows rapid access to blocks of rows for
  selected columns
 The time it takes to read a block of rows is
  essentially independent of the total number
  of rows and columns in the file
                  Revolution R Enterprise                    31
The XDF file format                            Revolution Confidential




 Data is stored in blocks of rows per column
 “Header” information is stored at the end of
  the file
 Allows sequential reads; tens to hundreds of
  thousands of times faster than random reads
 Essentially unlimited in size
   64 bit row indexing per column
   64 bit column indexing (but the practical limit is
    much smaller)
                    Revolution R Enterprise                     32
The XDF file format (2)                     Revolution Confidential




 Allows wide range of storage formats (1 byte
  to 8 byte signed and unsigned integers; 4
  and 8 byte floats; variable length strings)
 Both new rows and new columns can be
  added to a file without having to rewrite the
  file
 Changes to header information (variable
  names, descriptions, factor levels, and so
  on) are extremely fast
                  Revolution R Enterprise                    33
Overview of data path on a computer         Revolution Confidential




 DataSource object reads a chunk of data
  into a DataSet object on I/O thread
 DataSet is given to transformation code
  (data copied to R for R transformations);
  variables and rows may be created, removed
 Transformed DataSet is virtually split across
  computational cores and passed to
  ProcessData methods on different threads
 Any disk output is stored until I/O thread can
  write it to output DataSource
                  Revolution R Enterprise                    34
Handling data in memory                     Revolution Confidential




 Use of appropriate storage format reduces
  space and reduces time to move data in
  memory
 Data copying and conversion is minimized
 For instance, when adding a vector of
  unsigned shorts to a vector of doubles, the
  smaller type is not converted until loaded
  into the CPU


                  Revolution R Enterprise                    35
Use of multiple cores per computer          Revolution Confidential




 Code is internally “threaded” so that inter-
  process communication and data transfer is
  not required
 One core (typically) handles I/O, while the
  other cores process data from the previous
  read
 Data is virtually split across computational
  cores; each core thinks it has its own private
  copy
                  Revolution R Enterprise                    36
RevoScaleR – Multi-Threaded Processing                                                                 Revolution Confidential



                                                             Shared Memory

                                                          Data                  Data                    Data



                                  Core 0                 Core 1                 Core 2                Core n
    Disk                          (Thread 0)             (Thread 1)            (Thread 2)             (Thread n)


                                               Multicore Processor (4, 8, 16+ cores)


                                                                RevoScaleR

•    A RevoScaleR algorithm is provided a data source as input
•    The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread
     (Thread 0).
•    Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update
     intermediate results objects in memory
•    When all of the data is processed a master results object is created from the intermediate results objects
Use of multiple computers                   Revolution Confidential




 Key to efficiency is minimizing data transfer
  and communication
 Locality of data!
 For PEMA‟s, the master node controls
  computations, telling workers where to get
  data and what computations to do
 Intermediate results on each node are
  aggregated across cores
 Master node gathers all results, checks for
  convergence, and repeats if necessary
                  Revolution R Enterprise                    38
RevoScaleR – Distributed Computing                          Revolution Confidential



              Compute
  Data         Node
 Partition   (RevoScaleR)                  •   Portions of the data source are
                                               made available to each compute
              Compute                          node
  Data         Node
 Partition                                 •   RevoScaleR on the master node
             (RevoScaleR)
                             Master            assigns a task to each compute
                                               node
                             Node
              Compute       (RevoScaleR)
  Data                                     •   Each compute node independently
               Node                            processes its data, and returns it‟s
 Partition   (RevoScaleR)                      intermediate results back to the
                                               master node
              Compute
  Data                                     •   master node aggregates all of the
               Node                            intermediate results from each
 Partition   (RevoScaleR)                      compute node and produces the
                                               final result
Data management capabilities                  Revolution Confidential




   Import from external sources
   Transform and clean variables
   Code and recode factors
   Missing values
   Validation
   Sort (huge data, but not distributed)
   Merge
   Aggregate, summarize
                    Revolution R Enterprise                    40
Analysis algorithms                         Revolution Confidential




 Descriptive statistics (rxSummary)
 Tables and cubes (rxCube, rxCrossTabs)
 Correlations/covariances (rxCovCor, rxCor,
  rxCov, rxSSCP)
 Linear regressions (rxLinMod)
 Logistic regressions (rxLogit)
 K means clustering (rxKmeans)
 Predictions (scoring) (rxPredict)
 Other algorithms are being developed

                  Revolution R Enterprise                    41
Benchmarks using the airline data             Revolution Confidential




 Airline on-time performance data produced by U.S.
  Department of Transportation; used in the ASA
  Data Expo 09
 22 CSV files for the years 1987 – 2008
 123.5 million observations, 29 variables, about 13
  GB
 For benchmarks, sliced and replicated to get files
  with 1 million to about 1.25 billion rows
 5 node commodity cluster (4 cores @ 3.2GHz & 16
  GB RAM per node); Windows HPC Server 2008
                    Revolution R Enterprise                    42
Distributed Import of Airline Data           Revolution Confidential




 Copy the 22 CSV files to the 5 nodes,
  keeping rough balance in sizes
 Two pass import process:
   Pass1: Import/clean/transform/append

   Pass 2: Recode factors whose levels differ

 Import time: about 3 min 20 secs on 5 node
  cluster (about 17 minutes on single node)

                   Revolution R Enterprise                    43
Scalability of RevoScaleR with Rows
Regression, 1 million - 1.1 billion rows, 443 betas
                                              Revolution Confidential

                            (4 core laptop)




                   Revolution R Enterprise                     44
Scalability of RevoScaleR with Nodes
  Regression, 1 billion rows, 443 betas
                                      Revolution Confidential

               (1 to 5 nodes, 4 cores per node)




                  Revolution R Enterprise                45
Scalability of RevoScaleR with Nodes
                                                   Revolution Confidential
Logit, 123.5 million rows, 443 betas, 5 iterations
                (1 to 5 nodes, 4 cores per node)




                   Revolution R Enterprise                          46
Scalability of RevoScaleR with Nodes
Logit, 1 billion rows, 443 betas, 5 iterations Confidential
                                          Revolution

                (1 to 5 nodes, 4 cores per node)




                   Revolution R Enterprise              47
Comparative benchmarks                                Revolution Confidential




 SAS HPA benchmarks:
   Logit, billion rows, 32 nodes, 384 cores, in-memory,
    “just a few” parameters:             80 secs.
   Regression, 50 million rows, 24 nodes, in-memory,
    1,800 parameters:                    42 secs.
 ScaleR: 1 billion rows, 5 nodes, 20 cores ($5K)
     rxLogit, 7 parameters:                   43.4 secs
     rxLinMod, 7 parameters:                  4.1 secs
     rxLinMod, 1,848 params:                  11.9 secs
     rxLogit, 1,848 params:                   111.5 secs
     rxLinMod, 13,537 params:                 89.8 secs.
                     Revolution R Enterprise                           48
Contact information             Revolution Confidential




      Lee Edlefsen
lee@revolutionanalytics.com




        Revolution R Enterprise                    49

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Ten Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider GreenplumTen Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider GreenplumVMware Tanzu
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopRevolution Analytics
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview EMC
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobsBill Jacobs
 
Lecture 24
Lecture 24Lecture 24
Lecture 24Shani729
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems divjeev
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionRevolution Analytics
 
What is the future of etl tools like ab initio
What is the future of etl tools like ab initioWhat is the future of etl tools like ab initio
What is the future of etl tools like ab initiomaxonlinetr
 
Massive parallel processing database systems mpp
Massive parallel processing database systems mppMassive parallel processing database systems mpp
Massive parallel processing database systems mppDiana Patricia Rey Cabra
 
Ogf2008 Grid Data Caching
Ogf2008 Grid Data CachingOgf2008 Grid Data Caching
Ogf2008 Grid Data CachingJags Ramnarayan
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data Geoffrey Fox
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Daniel Abadi
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Coursejimliddle
 
In memory big data management and processing a survey
In memory big data management and processing a surveyIn memory big data management and processing a survey
In memory big data management and processing a surveyredpel dot com
 

Was ist angesagt? (20)

Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Ten Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider GreenplumTen Reasons Why Netezza Professionals Should Consider Greenplum
Ten Reasons Why Netezza Professionals Should Consider Greenplum
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Big data analytics on teradata with revolution r enterprise bill jacobs
Big data analytics on teradata with revolution r enterprise   bill jacobsBig data analytics on teradata with revolution r enterprise   bill jacobs
Big data analytics on teradata with revolution r enterprise bill jacobs
 
Teradata training
Teradata trainingTeradata training
Teradata training
 
Lecture 24
Lecture 24Lecture 24
Lecture 24
 
Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems Magic quadrant for data warehouse database management systems
Magic quadrant for data warehouse database management systems
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
What is the future of etl tools like ab initio
What is the future of etl tools like ab initioWhat is the future of etl tools like ab initio
What is the future of etl tools like ab initio
 
Massive parallel processing database systems mpp
Massive parallel processing database systems mppMassive parallel processing database systems mpp
Massive parallel processing database systems mpp
 
Ogf2008 Grid Data Caching
Ogf2008 Grid Data CachingOgf2008 Grid Data Caching
Ogf2008 Grid Data Caching
 
High Performance Computing and Big Data
High Performance Computing and Big Data High Performance Computing and Big Data
High Performance Computing and Big Data
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13Shared slides-edbt-keynote-03-19-13
Shared slides-edbt-keynote-03-19-13
 
Big data
Big dataBig data
Big data
 
Waters Grid & HPC Course
Waters Grid & HPC CourseWaters Grid & HPC Course
Waters Grid & HPC Course
 
In memory big data management and processing a survey
In memory big data management and processing a surveyIn memory big data management and processing a survey
In memory big data management and processing a survey
 

Ähnlich wie Scalable Data Analysis in R with Revolution's RevoScaleR Package

Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAlex Palamides
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisRevolution Analytics
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Revolution Analytics
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Principled Technologies
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014Kenneth Igiri
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 

Ähnlich wie Scalable Data Analysis in R with Revolution's RevoScaleR Package (20)

Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationModel Building with RevoScaleR: Using R and Hadoop for Statistical Computation
Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Revolution Analytics Podcast
Revolution Analytics PodcastRevolution Analytics Podcast
Revolution Analytics Podcast
 
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data AnalysisNew Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
New Features in Revolution R Enterprise 5.0 to Support Scalable Data Analysis
 
Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics? Are You Ready for Big Data Big Analytics?
Are You Ready for Big Data Big Analytics?
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
Run compute-intensive Apache Hadoop big data workloads faster with Dell EMC P...
 
Big data no company
Big data   no companyBig data   no company
Big data no company
 
Big Data Basic Concepts | Presented in 2014
Big Data Basic Concepts  | Presented in 2014Big Data Basic Concepts  | Presented in 2014
Big Data Basic Concepts | Presented in 2014
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 

Mehr von rusersla

LA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric KlusmanLA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric Klusmanrusersla
 
useR2011 - Whitcher
useR2011 - WhitcheruseR2011 - Whitcher
useR2011 - Whitcherrusersla
 
useR2011 - Rougier
useR2011 - RougieruseR2011 - Rougier
useR2011 - Rougierrusersla
 
useR2011 - Huber
useR2011 - HuberuseR2011 - Huber
useR2011 - Huberrusersla
 
useR2011 - Gromping
useR2011 - Gromping useR2011 - Gromping
useR2011 - Gromping rusersla
 
Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1rusersla
 
Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2rusersla
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2rusersla
 
Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1rusersla
 
Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3rusersla
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2rusersla
 

Mehr von rusersla (11)

LA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric KlusmanLA R meetup - Nov 2013 - Eric Klusman
LA R meetup - Nov 2013 - Eric Klusman
 
useR2011 - Whitcher
useR2011 - WhitcheruseR2011 - Whitcher
useR2011 - Whitcher
 
useR2011 - Rougier
useR2011 - RougieruseR2011 - Rougier
useR2011 - Rougier
 
useR2011 - Huber
useR2011 - HuberuseR2011 - Huber
useR2011 - Huber
 
useR2011 - Gromping
useR2011 - Gromping useR2011 - Gromping
useR2011 - Gromping
 
Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1Los Angeles R users group - July 12 2011 - Part 1
Los Angeles R users group - July 12 2011 - Part 1
 
Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2Los Angeles R users group - July 12 2011 - Part 2
Los Angeles R users group - July 12 2011 - Part 2
 
Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2Los Angeles R users group - Nov 17 2010 - Part 2
Los Angeles R users group - Nov 17 2010 - Part 2
 
Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1Los Angeles R users group - Dec 14 2010 - Part 1
Los Angeles R users group - Dec 14 2010 - Part 1
 
Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3Los Angeles R users group - Dec 14 2010 - Part 3
Los Angeles R users group - Dec 14 2010 - Part 3
 
Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2Los Angeles R users group - Dec 14 2010 - Part 2
Los Angeles R users group - Dec 14 2010 - Part 2
 

Kürzlich hochgeladen

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Kürzlich hochgeladen (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Scalable Data Analysis in R with Revolution's RevoScaleR Package

  • 1. Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1
  • 2. Introduction Revolution Confidential  Our ability to collect and store data has rapidly been outpacing our ability to analyze it  We need scalable data analysis software  R is the ideal platform for such software: universal data language, easy to add new functionality, powerful, flexible, forgiving  I will discuss the approach to scalability we have taken at Revolution Analytics with our package RevoScaleR 2
  • 3. RevoScaleR package Revolution Confidential  Part of Revolution R Enterprise  Provides data management and data analysis functionality  Scales from small to huge data  Is internally threaded to use multiple cores  Distributes computations across multiple computers (in version 5.0 beta on Windows)  Revolution R Enterprise is free for academic use Revolution R Enterprise 3
  • 4. Overview of my talk Revolution Confidential  What do I mean by scalability?  Why bother?  Revolution‟s approach to scalability:  R code stays the same as much as possible  “Chunking” – operate on chunks of data  Parallel external memory algorithms  Implementation issues  Benchmarks Revolution R Enterprise 4
  • 5. Scalability Revolution Confidential  From small in-memory data.frames to multi- terabyte data sets distributed across space and even time  From single cores on a single computer to multiple cores on multiple computers/nodes  From local hardware to remote clouds  For both data management and data analysis Revolution R Enterprise 5
  • 6. Keys to scalability Revolution Confidential  Most importantly, must be able to process more data than can fit into memory on one computer at one time  Process data in “chunks”  Up to a point, the bigger the chunk the better  Allows faster sequential reads from disk  Minimizes thread overhead in memory  Takes maximum advantage of R‟s vectorized code  Requires “external memory algorithms” Revolution R Enterprise 6
  • 7. Huge data sets are becoming common Revolution Confidential  Digital data is not only cheaper to store, it is cheaper to collect  More data is collected digitally  Increasingly more data of all types is being put directly into data bases  It is easier now to merge and clean data, and there is a big and growing market for such data bases Revolution R Enterprise 7
  • 8. Huge benefits to huge data Revolution Confidential  More information, more to be learned  Variables and relationships can be visualized and analyzed in much greater detail  Can allow the data to speak for itself; can relax or eliminate assumptions  Can get better predictions and better understandings of effects Revolution R Enterprise 8
  • 9. Two huge problems: capacity and speed Revolution Confidential  Capacity: problems handling the size of data sets or models  Data too big to fit into memory  Even if it can fit, there are limits on what can be done  Even simple data management can be extremely challenging  Speed: even without a capacity limit, computation may be too slow to be useful 9
  • 10. We are currently incapable of analyzing a lot of the data we have Revolution Confidential  The most commonly-used statistical software tools either fail completely or are too slow to be useful on huge data sets  In many ways we are back where we were in the „70s and „80‟s in terms of ability to handle common data sizes  Fortunately, this is changing Revolution R Enterprise 10
  • 11. Requires software solutions Revolution Confidential  For decades the rising tide of technology has allowed the same data analysis code to run faster and on bigger data sets  That happy era has ended  The size of data sets is increasing much more rapidly than the speed of single cores, of I/O, and of RAM  Need software that can use multiple cores, multiple hard drives, and multiple computers Revolution R Enterprise 11
  • 12. High Performance Analytics: HPA Revolution Confidential  HPA is HPC + Data  High Performance Computing is CPU centric  Lots of processing on small amounts of data  Focus is on cores  High Performance Analytics is data centric  Less processing per amount of data  Focus is on feeding data to the cores  On disk I/O  On efficient threading, data management in RAM Revolution R Enterprise 12
  • 13. Revolution’s approach to HPA and scalability in the RevoScaleR package Revolution Confidential  User interface  Set of R functions (but mostly implemented in C++)  Keep things as familiar as possible  Capacity and speed  Handle data in large “chunks” – blocks of rows and columns  Use parallel external memory algorithms Revolution R Enterprise 13
  • 14. Some interface design goals Revolution Confidential  Should be able to run the same analysis code on a small data.frame in memory and on a huge distributed data file on a remote cluster  Should be able to do data transformations using standard R language on standard R objects  Should be able to use the R formula language for analysis Revolution R Enterprise 14
  • 15. Sample code for logit on laptop Revolution Confidential # # Standard formula language # # Allow transformations in the formula or # in a function or a list of expressions # # Row selections also allowed (not shown) rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 15
  • 16. Sample code for logit on a cluster Revolution Confidential # Just change the “compute context” rxOptions(computeContext = myClust) # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 16
  • 17. Sample data transformation code Revolution Confidential # # Standard R code on standard R objects # # Function or a list of expressions # # In separate data step or in analysis call transforms <- list( DepTime = ConvertToDecimalTime(DepTime), Late = ArrDelay > 15 ) Revolution R Enterprise 17
  • 18. Chunking Revolution Confidential  Operate blocks of rows for selected columns  Huge data can be processed a chunk at a time in a fixed amount of RAM  Bigger is better up to a point  Makes it possible to take maximum advantage of disk bandwidth  Minimizes threading overhead  Can take maximum advantage of R‟s vectorized code Revolution R Enterprise 18
  • 19. The basis for a solution for capacity, speed, distributed and streaming data – PEMA’s Revolution Confidential  Parallel external memory algorithms (PEMA’s) allow solution of both capacity and speed problems, and can deal with distributed and streaming data  External memory algorithms are those that allow computations to be split into pieces so that not all data has to be in memory at one time  It is possible to “automatically” parallelize and distribute such algorithms 19
  • 20. External memory algorithms are widely available Revolution Confidential  Some statistics packages originating in the 70‟s and 80‟s, such as SAS and Gauss, were based almost entirely on external memory algorithms  Examples of external memory algorithms:  data management (transformations, new variables, subsets, sorts, merges, converting to factors)  descriptive statistics, cross tabulations  linear models, glm, mixed effects  prediction/scoring  many maximum likelihood and other optimization algorithms  many clustering procedures, tree models 20
  • 21. Parallel algorithms are widely available Revolution Confidential  Parallel algorithms are those that allow different portions of a computation to be done “at the same time” using different cores  There has been an lot of research on parallel computing over the past 20-30 years, and many helpful tools are now available 21
  • 22. Not much literature on PEMA’s Revolution Confidential  Unfortunately, there is not much literature on Parallel External Memory Algorithms, especially for statistical computations  In what follows, I outline an approach to PEMA‟s for doing statistical computations Revolution R Enterprise 22
  • 23. Parallelizing external memory algorithms Revolution Confidential  Parallel algorithms split a job into tasks that can be run simultaneously  External memory algorithms split a job into tasks that operate on separate blocks data  External memory algorithms usually are run sequentially, but with proper care most can be parallelized efficiently and “automatically”  The key is to split the tasks appropriately and to minimize inter-thread communication and synchronization 23
  • 24. Parallel Computations with Synchronization and Communication Revolution Confidential 24
  • 25. Embarrassingly Parallel Computations Revolution Confidential 25
  • 26. Sequential External Memory Computations Revolution Confidential 26
  • 27. Parallel External Memory Computations Revolution Confidential 27
  • 28. Example external memory algorithm for the mean of a variable Revolution Confidential  Initialization task: total=0, count=0  Process data task: for each block of x; total = sum(x), count=length(x)  Update results task: combined total = sum(all totals), combined count = sum(all counts)  Process results task: mean = combined total / combined count 28
  • 29. PEMA’s in RevoScaleR Revolution Confidential  Analytics algorithms are implemented in a framework that automatically and efficiently parallelizes code  Key is arranging code in virtual functions:  Initialize: allows initializations  ProcessData: process a chunk of data at a time  UpdateResults: updates one set of results from another  ProcessResults: any final processing Revolution R Enterprise 29
  • 30. PEMA’s in RevoScaleR (2) Revolution Confidential  Analytic code is completely independent of:  Data path code that feeds data to ProcessData  Threading code for using multiple cores  Inter-computer communication code for distributing across nodes  Communication among machines is highly abstracted (currently can use MPI, RPC)  Implemented in C++ now, but we plan to add an implementation in R Revolution R Enterprise 30
  • 31. Storing and reading data Revolution Confidential  ScaleR can process data from a variety of sources  Has its own optimized format (XDF) that is especially suitable for chunking  Allows rapid access to blocks of rows for selected columns  The time it takes to read a block of rows is essentially independent of the total number of rows and columns in the file Revolution R Enterprise 31
  • 32. The XDF file format Revolution Confidential  Data is stored in blocks of rows per column  “Header” information is stored at the end of the file  Allows sequential reads; tens to hundreds of thousands of times faster than random reads  Essentially unlimited in size  64 bit row indexing per column  64 bit column indexing (but the practical limit is much smaller) Revolution R Enterprise 32
  • 33. The XDF file format (2) Revolution Confidential  Allows wide range of storage formats (1 byte to 8 byte signed and unsigned integers; 4 and 8 byte floats; variable length strings)  Both new rows and new columns can be added to a file without having to rewrite the file  Changes to header information (variable names, descriptions, factor levels, and so on) are extremely fast Revolution R Enterprise 33
  • 34. Overview of data path on a computer Revolution Confidential  DataSource object reads a chunk of data into a DataSet object on I/O thread  DataSet is given to transformation code (data copied to R for R transformations); variables and rows may be created, removed  Transformed DataSet is virtually split across computational cores and passed to ProcessData methods on different threads  Any disk output is stored until I/O thread can write it to output DataSource Revolution R Enterprise 34
  • 35. Handling data in memory Revolution Confidential  Use of appropriate storage format reduces space and reduces time to move data in memory  Data copying and conversion is minimized  For instance, when adding a vector of unsigned shorts to a vector of doubles, the smaller type is not converted until loaded into the CPU Revolution R Enterprise 35
  • 36. Use of multiple cores per computer Revolution Confidential  Code is internally “threaded” so that inter- process communication and data transfer is not required  One core (typically) handles I/O, while the other cores process data from the previous read  Data is virtually split across computational cores; each core thinks it has its own private copy Revolution R Enterprise 36
  • 37. RevoScaleR – Multi-Threaded Processing Revolution Confidential Shared Memory Data Data Data Core 0 Core 1 Core 2 Core n Disk (Thread 0) (Thread 1) (Thread 2) (Thread n) Multicore Processor (4, 8, 16+ cores) RevoScaleR • A RevoScaleR algorithm is provided a data source as input • The algorithm loops over data, reading a block at a time. Blocks of data are read by a separate worker thread (Thread 0). • Other worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update intermediate results objects in memory • When all of the data is processed a master results object is created from the intermediate results objects
  • 38. Use of multiple computers Revolution Confidential  Key to efficiency is minimizing data transfer and communication  Locality of data!  For PEMA‟s, the master node controls computations, telling workers where to get data and what computations to do  Intermediate results on each node are aggregated across cores  Master node gathers all results, checks for convergence, and repeats if necessary Revolution R Enterprise 38
  • 39. RevoScaleR – Distributed Computing Revolution Confidential Compute Data Node Partition (RevoScaleR) • Portions of the data source are made available to each compute Compute node Data Node Partition • RevoScaleR on the master node (RevoScaleR) Master assigns a task to each compute node Node Compute (RevoScaleR) Data • Each compute node independently Node processes its data, and returns it‟s Partition (RevoScaleR) intermediate results back to the master node Compute Data • master node aggregates all of the Node intermediate results from each Partition (RevoScaleR) compute node and produces the final result
  • 40. Data management capabilities Revolution Confidential  Import from external sources  Transform and clean variables  Code and recode factors  Missing values  Validation  Sort (huge data, but not distributed)  Merge  Aggregate, summarize Revolution R Enterprise 40
  • 41. Analysis algorithms Revolution Confidential  Descriptive statistics (rxSummary)  Tables and cubes (rxCube, rxCrossTabs)  Correlations/covariances (rxCovCor, rxCor, rxCov, rxSSCP)  Linear regressions (rxLinMod)  Logistic regressions (rxLogit)  K means clustering (rxKmeans)  Predictions (scoring) (rxPredict)  Other algorithms are being developed Revolution R Enterprise 41
  • 42. Benchmarks using the airline data Revolution Confidential  Airline on-time performance data produced by U.S. Department of Transportation; used in the ASA Data Expo 09  22 CSV files for the years 1987 – 2008  123.5 million observations, 29 variables, about 13 GB  For benchmarks, sliced and replicated to get files with 1 million to about 1.25 billion rows  5 node commodity cluster (4 cores @ 3.2GHz & 16 GB RAM per node); Windows HPC Server 2008 Revolution R Enterprise 42
  • 43. Distributed Import of Airline Data Revolution Confidential  Copy the 22 CSV files to the 5 nodes, keeping rough balance in sizes  Two pass import process:  Pass1: Import/clean/transform/append  Pass 2: Recode factors whose levels differ  Import time: about 3 min 20 secs on 5 node cluster (about 17 minutes on single node) Revolution R Enterprise 43
  • 44. Scalability of RevoScaleR with Rows Regression, 1 million - 1.1 billion rows, 443 betas Revolution Confidential (4 core laptop) Revolution R Enterprise 44
  • 45. Scalability of RevoScaleR with Nodes Regression, 1 billion rows, 443 betas Revolution Confidential (1 to 5 nodes, 4 cores per node) Revolution R Enterprise 45
  • 46. Scalability of RevoScaleR with Nodes Revolution Confidential Logit, 123.5 million rows, 443 betas, 5 iterations (1 to 5 nodes, 4 cores per node) Revolution R Enterprise 46
  • 47. Scalability of RevoScaleR with Nodes Logit, 1 billion rows, 443 betas, 5 iterations Confidential Revolution (1 to 5 nodes, 4 cores per node) Revolution R Enterprise 47
  • 48. Comparative benchmarks Revolution Confidential  SAS HPA benchmarks:  Logit, billion rows, 32 nodes, 384 cores, in-memory, “just a few” parameters: 80 secs.  Regression, 50 million rows, 24 nodes, in-memory, 1,800 parameters: 42 secs.  ScaleR: 1 billion rows, 5 nodes, 20 cores ($5K)  rxLogit, 7 parameters: 43.4 secs  rxLinMod, 7 parameters: 4.1 secs  rxLinMod, 1,848 params: 11.9 secs  rxLogit, 1,848 params: 111.5 secs  rxLinMod, 13,537 params: 89.8 secs. Revolution R Enterprise 48
  • 49. Contact information Revolution Confidential Lee Edlefsen lee@revolutionanalytics.com Revolution R Enterprise 49