SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Model Building with
RevoScaleR
Using R and Hadoop for Statistical Computation
Strata and Hadoop World 2013

Joseph Rickert, Revolution Analytics
Model Buliding with RevoScaleR
Agenda:
The three realms of data
What is RevoScaleR?
RevoScaleR working beside Hadoop
RevoScaleR running within Hadoop
Run some code
2
The 3 Realms of Data

Bridging the gaps between architectures
The 3 Realms of Data
Number of rows
The realm of
“chunking”

>1012

1011

The realm of
massive data

Data

in

Data in
a File

106
Data
In
Memory

Multipl
e

Files

Architectural complexity
4
RevoScaleR

Revolution R Enterprise
RevoScaleR
 An R package ships exclusively with Revolution R
Enterprise

Revolution R Enterprise

 Implements Parallel External Memory Algorithms
(PEMAs)
 Provides functions to:

DeployR
ConnectR

– Import, Clean, Explore and Transform Data
– Statistical Analysis and Predictive Analytics
– Enable distributed computing

RevoScaleR
DistributedR

 Scales from small local data to huge distributed
data
 The same code works on small and big data, and
on workstation, server, cluster, Hadoop
6
Parallel External Memory Algorithms (PEMA’s)
 Built on a platform (DistributeR)
that efficiently parallelizes a
broad class of statistical, data
mining and machine learning
algorithms
 Process data a chunk at a time in
parallel across cores and nodes:
1.
2.
3.
4.

Initialize
Process Chunk
Aggregate
Finalize

Revolution R Enterprise

DeployR
ConnectR
RevoScaleR
DistributedR

7
RevoScaleR PEMAs
Statistical Modeling

Machine Learning

Predictive Models









Covariance, Correlation, Sum of Squares
Multiple Linear Regression
Generalized Linear Models:
 All exponential family
distributions, Tweedie
distribution.
 Standard link functions
 user defined distributions & link
functions.
Classification & Regression Trees
Decision Forests
Predictions/scoring for models
Residuals for all models

Data Visualization





Histogram
Line Plot
Lorenz Curve
ROC Curves

Variable Selection



Stepwise Regression
PCA

Cluster Analysis


K-Means

Classification



Decision Trees
Decision Forests

Simulation


Parallel random number
generators for Monte
Carlo
8
GLM comparison using in-memory
data: glm() and ScaleR’s rxGlm()

Revolution R Enterprise

9
PEMAs: Optimized for Performance
 Arbitrarily large number of
rows in a fixed amount of
memory
 Scales linearly
 with the number of rows
 with the number of nodes

 Scales well
 with the number of cores per
node
 with the number of parameters

 Efficient

 Computational algorithms
 Memory management: minimize
copying
 File format: fast access by row and
column

 Heavy use of C++
 Models

 pre-analyzed to detect and remove
duplicate computations and points of
failure (singularities)
 Handle categorical variables
efficiently
10
Write Once. Deploy Anywhere.
Hadoop

Hortonworks
Cloudera

EDW

IBM
Teradata

Clustered Systems

Platform LSF
Microsoft HPC

Workstations & Servers

Desktop
Server
Linux

In the Cloud

Microsoft Azure Burst
Amazon AWS

DeployR
ConnectR
RevoScaleR
DistributedR

DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE

11
RRE in Hadoop
or

beside

inside

12
Revolution R
Enterprise
Architecture
 Use Hadoop for data
storage and data
preparation
 Use RevoScaleR on
a connected server
for predictive
modeling
 Use Hadoop for
model deployment
A Simple Goal: Hadoop As An R Engine.
Hadoop

Run Revolution R Enterprise code In
Hadoop without change
Provide RevoScaleR Pre-Parallelized
Algorithms

Eliminate:
 The Need To “Think In MapReduce”

 Data Movement
14
Revolution R
Enterprise
HDFS
Name Node

Architecture

MapReduce

Data Node

Use RevoScaleR inside
Hadoop for:
• Data preparation
• Model building
• Custom small-data
parallel programming
• Model deployment
• Late 2013: Big-data
predictive models with
ScaleR

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Job
Tracker
RRE in Hadoop
HDFS
Name Node

MapReduce

Data Node

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Job
Tracker

16
RRE in Hadoop
HDFS
Name Node

MapReduce

Data Node

Data Node

Data Node

Data Node

Data Node

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Job
Tracker

17
RevoScaleR on Hadoop
 Each pass through the data is one MapReduce job
 Prediction (Scoring), Transformation, Simulation:
– Map tasks store results in HDFS or return to client

 Statistics, Model Building, Visualization:
– Map tasks produce “intermediate result objects” that are
aggregated by a Reduce task
– Master process decides if another pass through the data is
required

 Data can be cached or stored in XDF binary format for
increased speed, especially on iterative algorithms
Revolution R Enterprise

18
Let’s run some code.
Backup slides
Sample code: logit on workstation
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit( ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData )

21
Sample code for logit on Hadoop
#

Change the “compute context”

rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year + Month +
DayOfWeek + UniqueCarrier + F(CRSDepTime),
data=airData)

22
Demo rxLinMod in Hadoop - Launching

Revolution R Enterprise

23
Demo rxLinMod in Hadoop - In Progress

Revolution R Enterprise

24
Demo rxLinMod in Hadoop - Completed

Revolution R Enterprise

25

Weitere ähnliche Inhalte

Was ist angesagt?

Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 

Was ist angesagt? (20)

R and Data Science
R and Data ScienceR and Data Science
R and Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
R server and spark
R server and sparkR server and spark
R server and spark
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 

Andere mochten auch

Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Ryan Rosario
 

Andere mochten auch (6)

Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
TikZ for economists
TikZ for economistsTikZ for economists
TikZ for economists
 
Getting Up to Speed with R: Certificate Program in R for Statistical Analysis...
Getting Up to Speed with R: Certificate Program in R for Statistical Analysis...Getting Up to Speed with R: Certificate Program in R for Statistical Analysis...
Getting Up to Speed with R: Certificate Program in R for Statistical Analysis...
 
R programming language
R programming languageR programming language
R programming language
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
Taking R to the Limit (High Performance Computing in R), Part 1 -- Paralleliz...
 

Ähnlich wie Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
Revolution Analytics
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
Supriya .
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
templedf
 

Ähnlich wie Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Data Science
Data ScienceData Science
Data Science
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 

Mehr von Revolution Analytics

The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 

Mehr von Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Model Building with RevoScaleR: Using R and Hadoop for Statistical Computation

  • 1. Model Building with RevoScaleR Using R and Hadoop for Statistical Computation Strata and Hadoop World 2013 Joseph Rickert, Revolution Analytics
  • 2. Model Buliding with RevoScaleR Agenda: The three realms of data What is RevoScaleR? RevoScaleR working beside Hadoop RevoScaleR running within Hadoop Run some code 2
  • 3. The 3 Realms of Data Bridging the gaps between architectures
  • 4. The 3 Realms of Data Number of rows The realm of “chunking” >1012 1011 The realm of massive data Data in Data in a File 106 Data In Memory Multipl e Files Architectural complexity 4
  • 6. RevoScaleR  An R package ships exclusively with Revolution R Enterprise Revolution R Enterprise  Implements Parallel External Memory Algorithms (PEMAs)  Provides functions to: DeployR ConnectR – Import, Clean, Explore and Transform Data – Statistical Analysis and Predictive Analytics – Enable distributed computing RevoScaleR DistributedR  Scales from small local data to huge distributed data  The same code works on small and big data, and on workstation, server, cluster, Hadoop 6
  • 7. Parallel External Memory Algorithms (PEMA’s)  Built on a platform (DistributeR) that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  Process data a chunk at a time in parallel across cores and nodes: 1. 2. 3. 4. Initialize Process Chunk Aggregate Finalize Revolution R Enterprise DeployR ConnectR RevoScaleR DistributedR 7
  • 8. RevoScaleR PEMAs Statistical Modeling Machine Learning Predictive Models        Covariance, Correlation, Sum of Squares Multiple Linear Regression Generalized Linear Models:  All exponential family distributions, Tweedie distribution.  Standard link functions  user defined distributions & link functions. Classification & Regression Trees Decision Forests Predictions/scoring for models Residuals for all models Data Visualization     Histogram Line Plot Lorenz Curve ROC Curves Variable Selection   Stepwise Regression PCA Cluster Analysis  K-Means Classification   Decision Trees Decision Forests Simulation  Parallel random number generators for Monte Carlo 8
  • 9. GLM comparison using in-memory data: glm() and ScaleR’s rxGlm() Revolution R Enterprise 9
  • 10. PEMAs: Optimized for Performance  Arbitrarily large number of rows in a fixed amount of memory  Scales linearly  with the number of rows  with the number of nodes  Scales well  with the number of cores per node  with the number of parameters  Efficient  Computational algorithms  Memory management: minimize copying  File format: fast access by row and column  Heavy use of C++  Models  pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently 10
  • 11. Write Once. Deploy Anywhere. Hadoop Hortonworks Cloudera EDW IBM Teradata Clustered Systems Platform LSF Microsoft HPC Workstations & Servers Desktop Server Linux In the Cloud Microsoft Azure Burst Amazon AWS DeployR ConnectR RevoScaleR DistributedR DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE 11
  • 13. Revolution R Enterprise Architecture  Use Hadoop for data storage and data preparation  Use RevoScaleR on a connected server for predictive modeling  Use Hadoop for model deployment
  • 14. A Simple Goal: Hadoop As An R Engine. Hadoop Run Revolution R Enterprise code In Hadoop without change Provide RevoScaleR Pre-Parallelized Algorithms Eliminate:  The Need To “Think In MapReduce”  Data Movement 14
  • 15. Revolution R Enterprise HDFS Name Node Architecture MapReduce Data Node Use RevoScaleR inside Hadoop for: • Data preparation • Model building • Custom small-data parallel programming • Model deployment • Late 2013: Big-data predictive models with ScaleR Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker
  • 16. RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 16
  • 17. RRE in Hadoop HDFS Name Node MapReduce Data Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Task Tracker Job Tracker 17
  • 18. RevoScaleR on Hadoop  Each pass through the data is one MapReduce job  Prediction (Scoring), Transformation, Simulation: – Map tasks store results in HDFS or return to client  Statistics, Model Building, Visualization: – Map tasks produce “intermediate result objects” that are aggregated by a Reduce task – Master process decides if another pass through the data is required  Data can be cached or stored in XDF binary format for increased speed, especially on iterative algorithms Revolution R Enterprise 18
  • 21. Sample code: logit on workstation # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit( ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData ) 21
  • 22. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) 22
  • 23. Demo rxLinMod in Hadoop - Launching Revolution R Enterprise 23
  • 24. Demo rxLinMod in Hadoop - In Progress Revolution R Enterprise 24
  • 25. Demo rxLinMod in Hadoop - Completed Revolution R Enterprise 25

Hinweis der Redaktion

  1. Coming soon: An “Besidevs Inside” architecture slide to precede this one.