SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
High Performance
Predictive Analytics
in R and Hadoop
Presented by:
Mario E. Inchiosa, Ph.D.
US Chief Scientist
Hadoop Summit 2013
June 27, 2013 1
Innovate with R
2
 Most widely used data analysis software
 Used by 2M+ data scientists, statisticians and analysts
 Most powerful statistical programming language
 Flexible, extensible and comprehensive for productivity
 Create beautiful and unique data visualizations
 As seen in New York Times, Twitter and Flowing Data
 Thriving open-source community
 Leading edge of analytics research
 Fills the talent gap
 New graduates prefer R
Download the White Paper
R is Hot
bit.ly/r-is-hot
R is open source and drives analytic innovation but has
some limitations
Bigger 
data sizes 
Speed of 
analysis 
Production 
support
Memory Bound Big Data
Single Threaded Scale out, parallel
processing, high speed
Community Support Commercial
production support
Innovation 
and scale
Innovative –
5000+ packages,
exponential growth
Combines with open
source R packages
where needed
4
Revolution R Enterprise
High Performance, Multi-Platform Analytics Platform
Revolution R EnterpriseRevolution R Enterprise
DeployR
Web Services
Software Development Kit
DeployR
Web Services
Software Development Kit
DevelopR
Integrated Development
Environment
DevelopR
Integrated Development
Environment
ConnectR
High Speed & Direct Connectors
Teradata, Hadooop (HDFS, Hbase), SAS, SPSS, CSV, OBDC
ScaleR
High Performance Big Data Analytics
Platform LSF, MS HPC Server,
MS Azure Burst, SMP Servers
DistributedR
Distributed Computing Framework
Platform LSF, MS HPC Server, MS
Azure Burst
RevoR
Performance Enhanced Open Source R+ CRAN packages
IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst,
Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers
Open Source R
Plus
Revolution Analytics
performance
enhancements
Revolution
Analytics
Value-Add
Components
Providing Power
and Scale to Open
Source R
Our objectives with respect to Hadoop
 Allow our customers to do predictive
analytics as easily in Hadoop as they can
using R on their laptops today
 Scalable and High Performance
 Provide the first enterprise-ready,
commercially supported, full-featured, out-of-
the-box Predictive Analytics suite running in
Hadoop
5
Overview of Revolution and Hadoop
 Revolution currently provides capabilities for
using R and Hadoop together
 RHadoop packages: rmr, rhdfs, rhbase
 RevoScaleR package: Rapidly read and process
data from HDFS
 We are in the process of porting
RevoScaleR to run fully “in Hadoop.” It
currently runs on laptops, workstations,
servers, and MPI-based clusters.
6
RHadoop: Map-Reduce with R
 Unlock data in Hadoop using only the R language
 No need to learn Java, Pig, Python or any other language
7
RevoScaleR Overview
8
 An R package that adds capabilities to R:
 Data Import/Clean/Explore/Transform
 Analytics – Descriptive and Predictive
 Parallel and distributed computing
 Visualization
 Scales from small local data to huge distributed
data
 Scales from laptop to server to cluster to cloud
 Portable – the same code works on small and
big data, and on laptop, server, cluster, Hadoop
Key ways RevoScaleR enhances R
 High Performance Computing (HPC)
functions: Parallel/Distributed computing
 High Performance Analytics (HPA) functions:
Big Data + Parallel/Distributed computing
 XDF file format; rapidly store and extract
data
 Use results from HPA and HPC functions in
other R packages
 Visualization functions
Revolution R Enterprise 9
High Performance Big Data Analytics with
ScaleR
10
Statistical
Tests
Machine
Learning
Simulation
Descriptive
Statistics
Data
Visualization
R Data Step
Predictive
Models
Sampling
ScaleR: High Performance Scalable
Parallel External Memory Algorithms
11
 Data import – Delimited,
Fixed, SAS, SPSS, OBDC
 Variable creation &
transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category
(means, sums)
 Data import – Delimited,
Fixed, SAS, SPSS, OBDC
 Variable creation &
transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category
(means, sums)
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
Data Prep, Distillation & Descriptive AnalyticsData Prep, Distillation & Descriptive Analytics
 Subsample (observations &
variables)
 Random Sampling
 Subsample (observations &
variables)
 Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
ScaleR: High Performance Scalable
Parallel External Memory Algorithms
12
 Sum of Squares (cross product
matrix for set variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
 Covariance & Correlation
Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
 Sum of Squares (cross product
matrix for set variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
 Covariance & Correlation
Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
 Histogram
 Line Plot
 Scatter Plot
 Lorenz Curve
 ROC Curves (actual data and
predicted values)
 Histogram
 Line Plot
 Scatter Plot
 Lorenz Curve
 ROC Curves (actual data and
predicted values)
 K-Means K-Means
Statistical ModelingStatistical Modeling
 Decision Trees Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine LearningMachine Learning
SimulationSimulation
 Monte Carlo Monte Carlo
HPC Capabilities in RevoScaleR
 Execute (essentially) any R function in
parallel on nodes and cores
 Results from all runs are returned in a list to
the user’s laptop
 Extensive control over parameters
 Extensive control over nodes, cores, and
times to run
 Ideal for simulations and for running R
functions on small amounts of data
Revolution R Enterprise 13
Key ScaleR HPA features
 Handles an arbitrarily large number of rows
in a fixed amount of memory
 Scales linearly with the number of rows
 Scales linearly with the number of nodes
 Scales well with the number of cores per
node
 Scales well with the number of parameters
 Extremely high performance
Revolution R Enterprise 14
Regression comparison using in-memory
data: lm() vs rxLinMod()
Revolution R Enterprise 17
lm() in R
lm() in Revolution R
rxLinMod() in Revolution R
GLM comparison using in-memory data:
glm() vs rxGlm()
Revolution R Enterprise 18
SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
19
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the CostRevolution R Enterprise Delivers Performance at 2% of the Cost
Allstate compares SAS, Hadoop and R
for Big-Data Insurance Models
Approach Platform Time to fit
SAS 16-core Sun Server 5 hours
Rmr/map reduce 10 node 8 cores-per-node
HadoopCluster
> 10 hours
Open source R 250 GB Server Impossible (> 3 days)
RevoScaleR 5-node (4 cores / node)
LSF cluster
5.7 minutes
Revolution R Enterprise 20
Generalized linear model, 150 million observations, 70 degrees of freedom
http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
What makes ScaleR so fast?
 Lot’s of things!
 This kind of performance requires
 Careful architecting that supports performance
 Constant, intense focus on all of the details
 A review of every line of code with an eye to
performance (in addition to giving the correct
answers, of course)
 Extensive profiling and continuous
benchmarking to detect problems and improve
code
Revolution R Enterprise 21
Specific speed-related factors -- 1
 Efficient computational algorithms
 Efficient memory management – minimize
data copying and data conversion
 Heavy use of C++ templates; optimal code
 Efficient data file format; fast access by row
and column
Revolution R Enterprise 22
Specific speed-related issues -- 2
 Models are pre-analyzed to detect and
remove duplicate computations and points of
failure (singularities)
 Handle categorical variables efficiently
Revolution R Enterprise 23
Parallel External Memory Algorithms
(PEMA’s)
 The HPA analytics algorithms are all built on
a platform that efficiently parallelizes a broad
class of statistical, data mining and machine
learning algorithms
 These Parallel External Memory Algorithms
(PEMA’s) process data a chunk at a time in
parallel across cores and nodes
Revolution R Enterprise 24
Scalability and portability of Revo’s
implementation of PEMA’s
 These PEMA algorithms can process an unlimited
number of rows of data in a fixed amount of RAM.
They process a chunk of data at a time, giving
linear scalability
 They are independent of the “compute context”
(number of cores, computers, distributed
computing platform), giving portability across these
dimensions
 They are independent of where the data is coming
from, giving portability with respect to data sources
Revolution R Enterprise 25
Simplified ScaleR Internal Architecture
Revolution R Enterprise 26
Analytics Engine
PEMA’s are implemented here
(Scalable, Parallelized, Threaded, Distributable)
Inter-process Communication
MPI, RPC, Sockets, Files
Data Sources
HDFS, Teradata, ODBC, SAS, SPSS,
CSV, Fixed, XDF
ScaleR on Hadoop – Base Implementation
 Each iteration (pass through the data) is a separate
MapReduce job
 The mapper produces “intermediate result objects”
for each task that are combined using a combiner
and then a reducer
 A master process decides if another pass through
the data is required
 Allow data to be stored temporarily (or longer) in
XDF binary format for increased speed especially
on iterative algorithms
Revolution R Enterprise 27
Performance enhancements
 A single Map job that handles all iterations,
to reduce overhead of starting multiple jobs
 Tree structured communication and
reduction to reduce time for those
operations, especially on large clusters
 Alternative communication schemes (e.g.
sockets) instead of files
Revolution R Enterprise 28
Functionality in common across HPA
algorithms
 Ability to do on-the-fly data transformations
in the R language
 In-formula: ArrDelay>15 ~ DayOfWeek + …
 As transform: Late = ArrDelay>15
 As transform function: Specify an R function to
process an entire chunk of data at a time
 Row selection
 Both standard weights and frequency
weights
Revolution R Enterprise 29
Portability of user code
 The same analytics code can be used on a
laptop, server, MPI cluster, or Hadoop
 The same analytics code can be used on a
small data.frame in memory and on a huge
distributed data file
Revolution R Enterprise 30
Sample code for logit on laptop
# Specify local data source
airData <- myLocalDataSource
# Specify model formula and parameters
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
Revolution R Enterprise 31
Sample code for logit on Hadoop
# Change the “compute context”
rxSetComputeContext(myHadoopCluster)
# Change the data source if necessary
airData <- myHadoopDataSource
# Otherwise, the code is the same
rxLogit(ArrDelay>15 ~ Origin + Year +
Month + DayOfWeek + UniqueCarrier +
F(CRSDepTime), data=airData)
Revolution R Enterprise 32
Thank You!
Mario Inchiosa
Revolution R Enterprise 33

Weitere ähnliche Inhalte

Was ist angesagt?

The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Willy Marroquin (WillyDevNET)
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
Revolution Analytics
 

Was ist angesagt? (20)

R and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with HadoopR and Big Data using Revolution R Enterprise with Hadoop
R and Big Data using Revolution R Enterprise with Hadoop
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
New Advances in High Performance Analytics with R: 'Big Data' Decision Trees ...
 
Intro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User WebinarIntro to R for SAS and SPSS User Webinar
Intro to R for SAS and SPSS User Webinar
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
Accelerating R analytics with Spark and  Microsoft R Server  for HadoopAccelerating R analytics with Spark and  Microsoft R Server  for Hadoop
Accelerating R analytics with Spark and Microsoft R Server for Hadoop
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
DeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence ApplicationsDeployR: Revolution R Enterprise with Business Intelligence Applications
DeployR: Revolution R Enterprise with Business Intelligence Applications
 
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 

Andere mochten auch

05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
Revolution Analytics
 
The power of abstraction
The power of abstractionThe power of abstraction
The power of abstraction
ACMBangalore
 
Transformation or Transition
Transformation or TransitionTransformation or Transition
Transformation or Transition
Mike Pounsford
 
Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)
Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)
Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)
CROW
 

Andere mochten auch (19)

Revolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R CommunityRevolution Analytics Supports the Open Source R Community
Revolution Analytics Supports the Open Source R Community
 
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
Webinar: Survival Analysis for Marketing Attribution - July 17, 2013
 
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
05Nov13 Webinar: Introducing Revolution R Enterprise 7 - The Big Data Big Ana...
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)
 
R2DOCX example
R2DOCX exampleR2DOCX example
R2DOCX example
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
2013 Future of Open Source - 7th Annual Survey results
2013 Future of Open Source - 7th Annual Survey results2013 Future of Open Source - 7th Annual Survey results
2013 Future of Open Source - 7th Annual Survey results
 
The power of abstraction
The power of abstractionThe power of abstraction
The power of abstraction
 
Youth Participation - learned lessons from Sensoa's history
Youth Participation - learned lessons from Sensoa's historyYouth Participation - learned lessons from Sensoa's history
Youth Participation - learned lessons from Sensoa's history
 
Transformation or Transition
Transformation or TransitionTransformation or Transition
Transformation or Transition
 
Conto+termico ordingroma 4_6+feb+2015 (2)
Conto+termico ordingroma 4_6+feb+2015 (2)Conto+termico ordingroma 4_6+feb+2015 (2)
Conto+termico ordingroma 4_6+feb+2015 (2)
 
One
OneOne
One
 
Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)
Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)
Duurzame asfaltontwikkelingen | Jan Voskuilen (InfraTech 2015)
 
Wie Startups unsere Welt verändern (2015)
Wie Startups unsere Welt verändern (2015) Wie Startups unsere Welt verändern (2015)
Wie Startups unsere Welt verändern (2015)
 
"Ora et Labora" la Via Francigena in provincia di Pavia @BIT2016
"Ora et Labora" la Via Francigena in provincia di Pavia @BIT2016"Ora et Labora" la Via Francigena in provincia di Pavia @BIT2016
"Ora et Labora" la Via Francigena in provincia di Pavia @BIT2016
 
Virtualni svet Second Life
Virtualni svet Second LifeVirtualni svet Second Life
Virtualni svet Second Life
 
Bolsas de Estudo para Australia
Bolsas de Estudo para AustraliaBolsas de Estudo para Australia
Bolsas de Estudo para Australia
 
Ne the god_of_religion_of_islam
Ne the god_of_religion_of_islamNe the god_of_religion_of_islam
Ne the god_of_religion_of_islam
 
最高の自分に進化する方法【コンサル起業実践講座】
最高の自分に進化する方法【コンサル起業実践講座】最高の自分に進化する方法【コンサル起業実践講座】
最高の自分に進化する方法【コンサル起業実践講座】
 

Ähnlich wie High Performance Predictive Analytics in R and Hadoop

Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
templedf
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
Revolution Analytics
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
Revolution Analytics
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
GapData Institute
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
rusersla
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
Revolution Analytics
 

Ähnlich wie High Performance Predictive Analytics in R and Hadoop (20)

Decision trees in hadoop
Decision trees in hadoopDecision trees in hadoop
Decision trees in hadoop
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
 
Revolution Analytics
Revolution AnalyticsRevolution Analytics
Revolution Analytics
 
What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2What's New in Revolution R Enterprise 6.2
What's New in Revolution R Enterprise 6.2
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
useR2011 - Edlefsen
useR2011 - EdlefsenuseR2011 - Edlefsen
useR2011 - Edlefsen
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 

Mehr von Revolution Analytics

Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
Revolution Analytics
 

Mehr von Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

High Performance Predictive Analytics in R and Hadoop

  • 1. High Performance Predictive Analytics in R and Hadoop Presented by: Mario E. Inchiosa, Ph.D. US Chief Scientist Hadoop Summit 2013 June 27, 2013 1
  • 2. Innovate with R 2  Most widely used data analysis software  Used by 2M+ data scientists, statisticians and analysts  Most powerful statistical programming language  Flexible, extensible and comprehensive for productivity  Create beautiful and unique data visualizations  As seen in New York Times, Twitter and Flowing Data  Thriving open-source community  Leading edge of analytics research  Fills the talent gap  New graduates prefer R Download the White Paper R is Hot bit.ly/r-is-hot
  • 3. R is open source and drives analytic innovation but has some limitations Bigger  data sizes  Speed of  analysis  Production  support Memory Bound Big Data Single Threaded Scale out, parallel processing, high speed Community Support Commercial production support Innovation  and scale Innovative – 5000+ packages, exponential growth Combines with open source R packages where needed
  • 4. 4 Revolution R Enterprise High Performance, Multi-Platform Analytics Platform Revolution R EnterpriseRevolution R Enterprise DeployR Web Services Software Development Kit DeployR Web Services Software Development Kit DevelopR Integrated Development Environment DevelopR Integrated Development Environment ConnectR High Speed & Direct Connectors Teradata, Hadooop (HDFS, Hbase), SAS, SPSS, CSV, OBDC ScaleR High Performance Big Data Analytics Platform LSF, MS HPC Server, MS Azure Burst, SMP Servers DistributedR Distributed Computing Framework Platform LSF, MS HPC Server, MS Azure Burst RevoR Performance Enhanced Open Source R+ CRAN packages IBM PureData (Netezza), Platform LSF, MS HPC Server, MS Azure Burst, Cloudera, Hortonworks, IBM Big Insights, Intel Hadoop, SMP servers Open Source R Plus Revolution Analytics performance enhancements Revolution Analytics Value-Add Components Providing Power and Scale to Open Source R
  • 5. Our objectives with respect to Hadoop  Allow our customers to do predictive analytics as easily in Hadoop as they can using R on their laptops today  Scalable and High Performance  Provide the first enterprise-ready, commercially supported, full-featured, out-of- the-box Predictive Analytics suite running in Hadoop 5
  • 6. Overview of Revolution and Hadoop  Revolution currently provides capabilities for using R and Hadoop together  RHadoop packages: rmr, rhdfs, rhbase  RevoScaleR package: Rapidly read and process data from HDFS  We are in the process of porting RevoScaleR to run fully “in Hadoop.” It currently runs on laptops, workstations, servers, and MPI-based clusters. 6
  • 7. RHadoop: Map-Reduce with R  Unlock data in Hadoop using only the R language  No need to learn Java, Pig, Python or any other language 7
  • 8. RevoScaleR Overview 8  An R package that adds capabilities to R:  Data Import/Clean/Explore/Transform  Analytics – Descriptive and Predictive  Parallel and distributed computing  Visualization  Scales from small local data to huge distributed data  Scales from laptop to server to cluster to cloud  Portable – the same code works on small and big data, and on laptop, server, cluster, Hadoop
  • 9. Key ways RevoScaleR enhances R  High Performance Computing (HPC) functions: Parallel/Distributed computing  High Performance Analytics (HPA) functions: Big Data + Parallel/Distributed computing  XDF file format; rapidly store and extract data  Use results from HPA and HPC functions in other R packages  Visualization functions Revolution R Enterprise 9
  • 10. High Performance Big Data Analytics with ScaleR 10 Statistical Tests Machine Learning Simulation Descriptive Statistics Data Visualization R Data Step Predictive Models Sampling
  • 11. ScaleR: High Performance Scalable Parallel External Memory Algorithms 11  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort  Merge  Split  Aggregate by category (means, sums)  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Min / Max  Mean  Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test Data Prep, Distillation & Descriptive AnalyticsData Prep, Distillation & Descriptive Analytics  Subsample (observations & variables)  Random Sampling  Subsample (observations & variables)  Random Sampling R Data Step Statistical Tests Sampling Descriptive Statistics
  • 12. ScaleR: High Performance Scalable Parallel External Memory Algorithms 12  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) - All exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions including: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Classification & Regression Trees  Predictions/scoring for models  Residuals for all models  Histogram  Line Plot  Scatter Plot  Lorenz Curve  ROC Curves (actual data and predicted values)  Histogram  Line Plot  Scatter Plot  Lorenz Curve  ROC Curves (actual data and predicted values)  K-Means K-Means Statistical ModelingStatistical Modeling  Decision Trees Decision Trees Predictive Models Cluster AnalysisData Visualization Classification Machine LearningMachine Learning SimulationSimulation  Monte Carlo Monte Carlo
  • 13. HPC Capabilities in RevoScaleR  Execute (essentially) any R function in parallel on nodes and cores  Results from all runs are returned in a list to the user’s laptop  Extensive control over parameters  Extensive control over nodes, cores, and times to run  Ideal for simulations and for running R functions on small amounts of data Revolution R Enterprise 13
  • 14. Key ScaleR HPA features  Handles an arbitrarily large number of rows in a fixed amount of memory  Scales linearly with the number of rows  Scales linearly with the number of nodes  Scales well with the number of cores per node  Scales well with the number of parameters  Extremely high performance Revolution R Enterprise 14
  • 15.
  • 16.
  • 17. Regression comparison using in-memory data: lm() vs rxLinMod() Revolution R Enterprise 17 lm() in R lm() in Revolution R rxLinMod() in Revolution R
  • 18. GLM comparison using in-memory data: glm() vs rxGlm() Revolution R Enterprise 18
  • 19. SAS HPA Benchmarking comparison* Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GB 19 Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 Double 45% 1/6th 5% 5% Revolution R Enterprise Delivers Performance at 2% of the CostRevolution R Enterprise Delivers Performance at 2% of the Cost
  • 20. Allstate compares SAS, Hadoop and R for Big-Data Insurance Models Approach Platform Time to fit SAS 16-core Sun Server 5 hours Rmr/map reduce 10 node 8 cores-per-node HadoopCluster > 10 hours Open source R 250 GB Server Impossible (> 3 days) RevoScaleR 5-node (4 cores / node) LSF cluster 5.7 minutes Revolution R Enterprise 20 Generalized linear model, 150 million observations, 70 degrees of freedom http://blog.revolutionanalytics.com/2012/10/allstate-big-data-glm.html
  • 21. What makes ScaleR so fast?  Lot’s of things!  This kind of performance requires  Careful architecting that supports performance  Constant, intense focus on all of the details  A review of every line of code with an eye to performance (in addition to giving the correct answers, of course)  Extensive profiling and continuous benchmarking to detect problems and improve code Revolution R Enterprise 21
  • 22. Specific speed-related factors -- 1  Efficient computational algorithms  Efficient memory management – minimize data copying and data conversion  Heavy use of C++ templates; optimal code  Efficient data file format; fast access by row and column Revolution R Enterprise 22
  • 23. Specific speed-related issues -- 2  Models are pre-analyzed to detect and remove duplicate computations and points of failure (singularities)  Handle categorical variables efficiently Revolution R Enterprise 23
  • 24. Parallel External Memory Algorithms (PEMA’s)  The HPA analytics algorithms are all built on a platform that efficiently parallelizes a broad class of statistical, data mining and machine learning algorithms  These Parallel External Memory Algorithms (PEMA’s) process data a chunk at a time in parallel across cores and nodes Revolution R Enterprise 24
  • 25. Scalability and portability of Revo’s implementation of PEMA’s  These PEMA algorithms can process an unlimited number of rows of data in a fixed amount of RAM. They process a chunk of data at a time, giving linear scalability  They are independent of the “compute context” (number of cores, computers, distributed computing platform), giving portability across these dimensions  They are independent of where the data is coming from, giving portability with respect to data sources Revolution R Enterprise 25
  • 26. Simplified ScaleR Internal Architecture Revolution R Enterprise 26 Analytics Engine PEMA’s are implemented here (Scalable, Parallelized, Threaded, Distributable) Inter-process Communication MPI, RPC, Sockets, Files Data Sources HDFS, Teradata, ODBC, SAS, SPSS, CSV, Fixed, XDF
  • 27. ScaleR on Hadoop – Base Implementation  Each iteration (pass through the data) is a separate MapReduce job  The mapper produces “intermediate result objects” for each task that are combined using a combiner and then a reducer  A master process decides if another pass through the data is required  Allow data to be stored temporarily (or longer) in XDF binary format for increased speed especially on iterative algorithms Revolution R Enterprise 27
  • 28. Performance enhancements  A single Map job that handles all iterations, to reduce overhead of starting multiple jobs  Tree structured communication and reduction to reduce time for those operations, especially on large clusters  Alternative communication schemes (e.g. sockets) instead of files Revolution R Enterprise 28
  • 29. Functionality in common across HPA algorithms  Ability to do on-the-fly data transformations in the R language  In-formula: ArrDelay>15 ~ DayOfWeek + …  As transform: Late = ArrDelay>15  As transform function: Specify an R function to process an entire chunk of data at a time  Row selection  Both standard weights and frequency weights Revolution R Enterprise 29
  • 30. Portability of user code  The same analytics code can be used on a laptop, server, MPI cluster, or Hadoop  The same analytics code can be used on a small data.frame in memory and on a huge distributed data file Revolution R Enterprise 30
  • 31. Sample code for logit on laptop # Specify local data source airData <- myLocalDataSource # Specify model formula and parameters rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 31
  • 32. Sample code for logit on Hadoop # Change the “compute context” rxSetComputeContext(myHadoopCluster) # Change the data source if necessary airData <- myHadoopDataSource # Otherwise, the code is the same rxLogit(ArrDelay>15 ~ Origin + Year + Month + DayOfWeek + UniqueCarrier + F(CRSDepTime), data=airData) Revolution R Enterprise 32