SlideShare a Scribd company logo
1 of 26
1
Distributed R
The Next Generation Platform for Predictive Analytics
Jorge Martinez
Vishrut Gupta
Ed Ma April 10th, 2015
2
About me
FPGAs
Barcelona
2009
Embedded
software,
GPUs
Barcelona
2011
Distributed
systems
and ML
SF
2013
@jorgemarsal
http://github.com/jorgemarsal
3
The data
explosion
4
Horizontal scaling
The shift from BI to Data Science
The shift from BI to
data science
Happens!
https://www.youtube.com/watch?v=vbb-AjiXyh0
5
Predictive analytics workflow
Build Models
Evaluate Models
Deploy
Models
(In-database
scoring)
BI Integration
1 2
3
Build and evaluate predictive
models on large datasets
using Distributed R
2
1 Ingest and prepare data by
leveraging HP Vertica
Analytics Platform (SQL DB)
3 Deploy models to Vertica and
use in-database scoring to
produce prediction results for
BI and applications.
6
Data Scientists Preferred Languages: R & SQL
Adoption of R increased across industries
1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html
2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
7
R is …
“The best thing about R is that it was developed by statisticians. The worst
thing about R is that… it was developed by statisticians.”
-Bo Cogwill, Google
8
R is ….
Popular
Not
scalable
Open
source No parallel
algorithms
Flexible
Extensible
Limited
pre/post
processing
9
Horizontal scaling
Functional programming and big dataScale-out
Scale-out
10
Horizontal scaling
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
“The future has arrived, it’s just
not evenly distributed yet”
- William Gibson
Ship code to data,
Functional programming
11
Distributed R
The Next Generation Platform for Predictive Analytics
12
Distributed R
ANew Enterpriseclass predictive analytics platform
A scalable, high-performance platform for the R language
• Implemented as an R package
• Open source
Use familiar GUIs
and packages
Analyze data too
large for vanilla R
Leverage multiple
nodes for
distributed
processing
Vastly
improved
performance
13
Distributed R: architecture
Master
• Schedules tasks across the cluster.
• Sends commands/code to workers
Workers
• Do the actual work
• Own the data
• Work on independent data partitions in
parallel
DistR Master
Worker 1
Worker 2
Worker 3
Worker 4
14
• Relies on user defined partitioning
• Also support for distributed data-frames and lists
darray
Distributed R: Distributed data structures
15
• Express computations over partitions
• Execute across the cluster
foreach
Distributed R: Distributed code
f (x)
16
Distributed R basic demo
17
• Similar signature, accuracy as R packages
• Scalable and high performance
• E.g., regression on billions of rows in a couple of minutes
Distributed R: Built-in distributed algorithms
Algorithm Use cases
Linear Regression (GLM) Risk Analysis, Trend Analysis, etc.
Logistic Regression (GLM)
Customer Response modeling, Healthcare analytics
(Disease analysis)
Random Forest Customer churn, Market campaign analysis
K-Means Clustering
Customer segmentation, Fraud detection, Anomaly
detection
Page Rank Identify influencers
18
Distributed R March Madness demo
19
Parallel Random Forest Example
Random Forest – building an
ensemble of deep decision trees
Need to build 100 decision trees on 4
machines
Each machine builds 25 decision trees
Can use random forest to predict
March Madness Bracket
X
7
>
5
X1
2
>
3.
4
X
3
>
3
01 10
21
March Madness Bracket
Train Model to predict individual games
Use team and opponent features to train a model
• blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy
Calculate the summary statistics of each team
Group by teams and get the mean of each team’s features
Predict the result of the game
Concatenate the summary statistics of the team and feed to model that predicts individual
games
Fill out bracket by predicting 1 game at the time
22
23
Distributed R Census demo using Shiny
http://15.126.194.41/public/index.html
24
Distributed R rocks!
• Regression on billions of rows in minutes
• Graph algorithms on 10B edges
• Load 400GB+ data from database to R in < 10 minutes
• Open source!
25
That’s cool… what can I do with it?
• Collaborate
• Github (report issues, send PRs) https://github.com/vertica/DistributedR
• Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/
• Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica-
distributed-r/
• Buy commercial support
26
“The future has already arrived,
it’s just not evenly distributed yet”
- William Gibson
Thank you
http://github.com/vertica/distributedr

More Related Content

What's hot

Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...TigerGraph
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R ServicesGregg Barrett
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project PresentationShubham Shrivastava
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisYuanyuan Tian
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormRevolution Analytics
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in RRory Winston
 
BDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical OverviewBDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical OverviewBigData_Europe
 
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AITigerGraph
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedRevolution Analytics
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARAlexander Falk
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph AnalyticsLinkurious
 
Platform introduction & Summary
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & SummaryBigData_Europe
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 

What's hot (20)

Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...Better Together: How Graph database enables easy data integration with Spark ...
Better Together: How Graph database enables easy data integration with Spark ...
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Data Visualization Project Presentation
Data Visualization Project PresentationData Visualization Project Presentation
Data Visualization Project Presentation
 
Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph AnalysisBig Data Analytics: From SQL to Machine Learning and Graph Analysis
Big Data Analytics: From SQL to Machine Learning and Graph Analysis
 
Batter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and StormBatter Up! Advanced Sports Analytics with R and Storm
Batter Up! Advanced Sports Analytics with R and Storm
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Streaming Data in R
Streaming Data in RStreaming Data in R
Streaming Data in R
 
BDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical OverviewBDE-BDVA Webinar: BDE Technical Overview
BDE-BDVA Webinar: BDE Technical Overview
 
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AIGraph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
Graph Gurus 21: Integrating Real-Time Deep-Link Graph Analytics with Spark AI
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce SettingApplied Machine Learning for Ranking Products in an Ecommerce Setting
Applied Machine Learning for Ranking Products in an Ecommerce Setting
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
 
Altova NIEM keynote
Altova NIEM keynoteAltova NIEM keynote
Altova NIEM keynote
 
GraphTech Ecosystem - part 2: Graph Analytics
 GraphTech Ecosystem - part 2: Graph Analytics GraphTech Ecosystem - part 2: Graph Analytics
GraphTech Ecosystem - part 2: Graph Analytics
 
Platform introduction & Summary
Platform introduction & SummaryPlatform introduction & Summary
Platform introduction & Summary
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 

Viewers also liked

Building a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective ViewpointBuilding a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective ViewpointTristan Handy
 
Hp distributed R User Guide
Hp distributed R User GuideHp distributed R User Guide
Hp distributed R User GuideAndrey Karpov
 
OSGeo와 Open Data
OSGeo와 Open DataOSGeo와 Open Data
OSGeo와 Open Datar-kor
 
황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결r-kor
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsR Systems International
 
Optimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with ROptimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with RDomino Data Lab
 
The Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the WorldThe Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the WorldGE
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...William Markito Oliveira
 
Cloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT ApplicationsCloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT ApplicationsCorley S.r.l.
 
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...In-Memory Computing Summit
 
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화r-kor
 
Trading System Design
Trading System DesignTrading System Design
Trading System DesignMarketcalls
 
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LDr-kor
 
Trading sentimental analysis
Trading sentimental analysisTrading sentimental analysis
Trading sentimental analysisMarketcalls
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirDr. Haxel Consult
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati
 

Viewers also liked (20)

Building a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective ViewpointBuilding a Mature Analytics Workflow: The Analyst Collective Viewpoint
Building a Mature Analytics Workflow: The Analyst Collective Viewpoint
 
resume
resumeresume
resume
 
Hp distributed R User Guide
Hp distributed R User GuideHp distributed R User Guide
Hp distributed R User Guide
 
OSGeo와 Open Data
OSGeo와 Open DataOSGeo와 Open Data
OSGeo와 Open Data
 
황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결황성수 공공데이터 개방과 공공이슈 해결
황성수 공공데이터 개방과 공공이슈 해결
 
Deciphering voice of customer through speech analytics
Deciphering voice of customer through speech analyticsDeciphering voice of customer through speech analytics
Deciphering voice of customer through speech analytics
 
Optimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with ROptimizing Facebook Campaigns with R
Optimizing Facebook Campaigns with R
 
R lecture oga
R lecture ogaR lecture oga
R lecture oga
 
The Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the WorldThe Next List: R&D Breakthroughs that are Changing the World
The Next List: R&D Breakthroughs that are Changing the World
 
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...Implementing a highly scalable stock prediction system with R, Geode, SpringX...
Implementing a highly scalable stock prediction system with R, Geode, SpringX...
 
Cloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT ApplicationsCloud Conf 2015 - Develop and Deploy IOT Applications
Cloud Conf 2015 - Develop and Deploy IOT Applications
 
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Me...
 
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
오픈데이터와 오픈소스 소프트웨어를 이용한 의료이용정보의 시각화
 
Trading System Design
Trading System DesignTrading System Design
Trading System Design
 
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
구조화된 데이터: Schema.org와 Microdata, RDFa, JSON-LD
 
Trading sentimental analysis
Trading sentimental analysisTrading sentimental analysis
Trading sentimental analysis
 
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla AirII-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
II-SDV 2016 Patrick Beaucamp - Data Science with R and Vanilla Air
 
Language R
Language RLanguage R
Language R
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 

Similar to Distributed R: The Next Generation Platform for Predictive Analytics

End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed RJorge Martinez de Salinas
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesRevolution Analytics
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar Revolution Analytics
 
Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...Naveen Kumar
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDatabricks
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15MLconf
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Precisely
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfCarlos Paredes
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-Ajay Ohri
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowJan Kirenz
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 

Similar to Distributed R: The Next Generation Platform for Predictive Analytics (20)

End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed REnd-to-end Machine Learning Pipelines with HP Vertica and Distributed R
End-to-end Machine Learning Pipelines with HP Vertica and Distributed R
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar 18Mar14 Find the Hidden Signal in Market Data Noise Webinar
18Mar14 Find the Hidden Signal in Market Data Noise Webinar
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...Hedge Fund case study solution - Credit default swaps execution system and Gr...
Hedge Fund case study solution - Credit default swaps execution system and Gr...
 
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at NationwideDeploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
Deploying Enterprise Scale Deep Learning in Actuarial Modeling at Nationwide
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Data Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data ScienceData Science as a Service: Intersection of Cloud Computing and Data Science
Data Science as a Service: Intersection of Cloud Computing and Data Science
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
 
Pan Dhoni - Modernizing Data And Analytics using AI.pdf
Pan Dhoni - Modernizing Data And Analytics using AI.pdfPan Dhoni - Modernizing Data And Analytics using AI.pdf
Pan Dhoni - Modernizing Data And Analytics using AI.pdf
 
Dinkar mishra101206
Dinkar mishra101206Dinkar mishra101206
Dinkar mishra101206
 
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
Machine Learning & IT Service Intelligence for the Enterprise: The Future is ...
 
CV
CVCV
CV
 
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdfMachine_Learning_with_MATLAB_Seminar_Latest.pdf
Machine_Learning_with_MATLAB_Seminar_Latest.pdf
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-
 
Evaluation guide to Streaming Analytics
Evaluation guide to Streaming AnalyticsEvaluation guide to Streaming Analytics
Evaluation guide to Streaming Analytics
 
Resume kartikeya sharma
Resume kartikeya sharmaResume kartikeya sharma
Resume kartikeya sharma
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Distributed R: The Next Generation Platform for Predictive Analytics

  • 1. 1 Distributed R The Next Generation Platform for Predictive Analytics Jorge Martinez Vishrut Gupta Ed Ma April 10th, 2015
  • 4. 4 Horizontal scaling The shift from BI to Data Science The shift from BI to data science Happens! https://www.youtube.com/watch?v=vbb-AjiXyh0
  • 5. 5 Predictive analytics workflow Build Models Evaluate Models Deploy Models (In-database scoring) BI Integration 1 2 3 Build and evaluate predictive models on large datasets using Distributed R 2 1 Ingest and prepare data by leveraging HP Vertica Analytics Platform (SQL DB) 3 Deploy models to Vertica and use in-database scoring to produce prediction results for BI and applications.
  • 6. 6 Data Scientists Preferred Languages: R & SQL Adoption of R increased across industries 1) http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html 2) http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html
  • 7. 7 R is … “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” -Bo Cogwill, Google
  • 8. 8 R is …. Popular Not scalable Open source No parallel algorithms Flexible Extensible Limited pre/post processing
  • 9. 9 Horizontal scaling Functional programming and big dataScale-out Scale-out
  • 10. 10 Horizontal scaling “The future has arrived, it’s just not evenly distributed yet” - William Gibson “The future has arrived, it’s just not evenly distributed yet” - William Gibson Ship code to data, Functional programming
  • 11. 11 Distributed R The Next Generation Platform for Predictive Analytics
  • 12. 12 Distributed R ANew Enterpriseclass predictive analytics platform A scalable, high-performance platform for the R language • Implemented as an R package • Open source Use familiar GUIs and packages Analyze data too large for vanilla R Leverage multiple nodes for distributed processing Vastly improved performance
  • 13. 13 Distributed R: architecture Master • Schedules tasks across the cluster. • Sends commands/code to workers Workers • Do the actual work • Own the data • Work on independent data partitions in parallel DistR Master Worker 1 Worker 2 Worker 3 Worker 4
  • 14. 14 • Relies on user defined partitioning • Also support for distributed data-frames and lists darray Distributed R: Distributed data structures
  • 15. 15 • Express computations over partitions • Execute across the cluster foreach Distributed R: Distributed code f (x)
  • 17. 17 • Similar signature, accuracy as R packages • Scalable and high performance • E.g., regression on billions of rows in a couple of minutes Distributed R: Built-in distributed algorithms Algorithm Use cases Linear Regression (GLM) Risk Analysis, Trend Analysis, etc. Logistic Regression (GLM) Customer Response modeling, Healthcare analytics (Disease analysis) Random Forest Customer churn, Market campaign analysis K-Means Clustering Customer segmentation, Fraud detection, Anomaly detection Page Rank Identify influencers
  • 18. 18 Distributed R March Madness demo
  • 19. 19 Parallel Random Forest Example Random Forest – building an ensemble of deep decision trees Need to build 100 decision trees on 4 machines Each machine builds 25 decision trees Can use random forest to predict March Madness Bracket X 7 > 5 X1 2 > 3. 4 X 3 > 3 01 10
  • 20. 21 March Madness Bracket Train Model to predict individual games Use team and opponent features to train a model • blocks, steals, assists, rebounds, free throw accuracy, field goal accuracy, 3 point accuracy Calculate the summary statistics of each team Group by teams and get the mean of each team’s features Predict the result of the game Concatenate the summary statistics of the team and feed to model that predicts individual games Fill out bracket by predicting 1 game at the time
  • 21. 22
  • 22. 23 Distributed R Census demo using Shiny http://15.126.194.41/public/index.html
  • 23. 24 Distributed R rocks! • Regression on billions of rows in minutes • Graph algorithms on 10B edges • Load 400GB+ data from database to R in < 10 minutes • Open source!
  • 24. 25 That’s cool… what can I do with it? • Collaborate • Github (report issues, send PRs) https://github.com/vertica/DistributedR • Standardization with R-core http://www.r-bloggers.com/enhancing-r-for-distributed-computing/ • Get the SW + docs: http://www.vertica.com/hp-vertica-products/hp-vertica- distributed-r/ • Buy commercial support
  • 25. 26 “The future has already arrived, it’s just not evenly distributed yet” - William Gibson