SlideShare a Scribd company logo
1 of 20
Speeding up R
with Parallel Programming in the Cloud
David M Smith
Developer Advocate, Microsoft
@revodavid
Hi. I’m David.
• Cloud Developer Advocate at
Microsoft
– cda.ms/7f
• Board Member, R Consortium
• Editor, Revolutions blog
– cda.ms/7g
• Twitter: @revodavid
Speeding up your R code
• Vectorize
3 @revodavid
• Vectorize
• Moar megahertz!
• Moar RAM!
• Moar cores!
• Moar computers!
• Moar cloud!
Embarassingly Parallel Problems
Easy to speed things up when:
• Calculating similar things many times
– Iterations in a loop, chunks of data, …
• Calculations are independent of each other
• Each calculation takes a decent amount of time
Just run multiple calculations at the same time
4 @revodavid
The Birthday Paradox
What is the likelihood that there are two people in this room
who share the same birthday?
5 @revodavid
Birthday Problem Simulator
6
pbirthdaysim <- function(n) {
ntests <- 100000
pop <- 1:365
anydup <- function(i)
any(duplicated(
sample(pop, n, replace=TRUE)))
sum(sapply(seq(ntests), anydup)) / ntests
}
bdayp <- sapply(1:100, pbirthdaysim)  About 3 minutes
(on this laptop)
library(foreach)
Looping with the foreach package on CRAN
– x is a list of results
– each entry calculated from RHS of %dopar%
Learn more about foreach: cda.ms/tc
7 @revodavid
x <- foreach (n=1:100) %dopar% pbirthdaysim(n)
Parallel processing with foreach
• Change how processing is done by registering a backend
– registerDoSEQ() sequential processing (default)
– registerdoMC() local cores via multicore (Mac / Linux)
– registerdoParallel() local cluster via library(parallel)
– registerdoFuture() HPC schedulers incl LSF, OpenLava & Slurm
– registerAzureParallel() remote cluster in Azure Batch
• Whatever you use, call to foreach does not change
– Also: no need to worry about data, packages etc. (mostly)
8 @revodavid
Local multi-process: foreach + doParallel
library(doParallel)
cl <- makeCluster(2) # local cluster, 2 workers
registerDoParallel(cl)
bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
Launches a separate R process for each task
Works on all platforms (Windows / Mac / Unix)
9 @revodavid
Single node multicore: doMC on 16-CPU VM
10 @revodavid
> library(doMC)
> registerdoMC(cluster)
> system.time( bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n) )
user system elapsed
291.617 0.760 20.743
About 11x times faster than sequential
• 20s vs 220s (using Azure DS5v2 instance)
Note: disable multithreaded BLAS to avoid core contention
• In Microsoft R Open: setMKLthreads(1)
Cloud cluster: foreach + doAzureParallel
doAzureParallel: A simple, open-source R package that uses the
Azure Batch cluster service as a parallel-backend for foreach
github.com/Azure/doAzureParallel
Birthday simulation: cluster
8-node cluster (standard D2v2: 2 vCPU, 7 Gb)
• specify VM class in cluster.json
• specify credentials for Azure Batch and Azure Storage in credentials.json
12 @revodavid
library(doAzureParallel)
setCredentials("credentials.json")
cluster <- makeCluster("cluster.json")
registerDoAzureParallel(cluster)
bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n)
bdayp <- unlist(bdayp)
cluster.json (excerpt):
"name": "davidsmi8caret",
"vmSize": "Standard_D2_v2",
"maxTasksPerNode": 8,
"poolSize": {
"dedicatedNodes": {
"min": 8,
"max": 8
}
45 seconds (more than 5 times faster) on a warm start
Cross-validation with caret
• Most predictive modeling algorithms have
“tuning parameters”
• Example: Boosted Trees
– Boosting iterations
– Max Tree Depth
– Shrinkage
• Parameters affect model performance
• Try ‘em out: cross-validate
14 @revodavid
grid <-
data.frame(
nrounds = …,
max_depth = …,
gamma = …,
colsample_bytree = …,
min_child_weight = …,
subsample = …)
)
Cross-validation in parallel
• Caret’s train function will automatically
use the registered foreach backend
• Just register your cluster first:
registerDoAzureParallel(cluster)
• Handles sending objects, packages to
nodes
15 @revodavid
mod <- train(
Class ~ .,
data = dat,
method = "xgbTree",
trControl = ctrl,
tuneGrid = grid,
nthread = 1
)
Low Priority Nodes
• Low-Priority = (very) Low Costs VMs from surplus capacity
– up to 80% discount
• Clusters can mix dedicated VMs and low-priority VMs
My Local R
Session
Azure Batch
Low Priority VMs
at up to 80% discount
Dedicated VMs "poolSize": {
"dedicatedNodes": {
"min": 3,
"max": 3
},
"lowPriorityNodes": {
"min": 9,
"max": 9
},
caret speed-ups
• Max Kuhn
benchmarked various
hardware and OS for
local parallel
– cda.ms/6V
• Let’s see how it works
with doAzureParallel
17 @revodavid
Source: cda.ms/6V
Packages and Containers
• Docker images used to spawn nodes
– Default: rocker/tidyverse:latest
– Lots of R packages pre-installed
• But this cross-validation also needs:
– xgboost, e1071
• Easy fix: add to cluster.json
18 @revodavid
{
"name": "davidsmi8caret",
"vmSize": "Standard_D2_v2",
"maxTasksPerNode": 8,
"poolSize": {
"dedicatedNodes": {
"min": 4,
"max": 4
},
"lowPriorityNodes": {
"min": 4,
"max": 4
},
"autoscaleFormula": "QUEUE"
},
"containerImage":
"rocker/tidyverse:latest",
"rPackages": {
"cran": ["xgboost","e1071"],
"github": [],
"bioconductor": []
},
"commandLine": []
}
19 @revodavid
===================================================================================
Id: job20180126022301
chunkSize: 1
enableCloudCombine: TRUE
packages:
caret;
errorHandling: stop
wait: TRUE
autoDeleteJob: TRUE
===================================================================================
Submitting tasks (1250/1250)
Submitting merge task. . .
Job Preparation Status: Package(s) being installed.......
Waiting for tasks to complete. . .
| Progress: 13.84% (173/1250) | Running: 59 | Queued: 1018 | Completed: 173 | Failed: 0 |
MY LAPTOP: 78 minutes
THIS CLUSTER: 16 minutes
(almost 5x faster)
Compute costs
• Pay by the minute for VMs used in cluster
• Using D2v2 Virtual Machines
– Ubuntu 16, 7Mb RAM, 2-core “compute optimized”
• 17 minutes × 8 VMs @ $0.14 / hour
– about 32 cents (not counting startup or storage)
20 @revodavid
Thank you!
Find foreach on CRAN: CRAN.R-project.org/package=foreach
Code for birthday problem simulation: cda.ms/7d
Docs for the foreach package: cda.ms/tc
Get doAzureParallel: cda.ms/7w
Get Azure subscription with $200 free credits: cda.ms/td
David Smith
davidsmi@microsoft.com
22 @revodavid

More Related Content

What's hot

Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...DataStax
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...DataStax
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBSJim Plush
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules RestructuredDoiT International
 
(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second
(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second
(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per SecondAmazon Web Services
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraDataStax Academy
 
pgday.seoul 2019: TimescaleDB
pgday.seoul 2019: TimescaleDBpgday.seoul 2019: TimescaleDB
pgday.seoul 2019: TimescaleDBChan Shik Lim
 
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...DataStax
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Cosmin Lehene
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...DataStax
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writesInstaclustr
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014Amazon Web Services
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014Amazon Web Services
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpDataStax
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra ClustersInstaclustr
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 

What's hot (20)

Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
 
Move Over, Rsync
Move Over, RsyncMove Over, Rsync
Move Over, Rsync
 
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
Cassandra Backups and Restorations Using Ansible (Joshua Wickman, Knewton) | ...
 
1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS1 Million Writes per second on 60 nodes with Cassandra and EBS
1 Million Writes per second on 60 nodes with Cassandra and EBS
 
Terraform Modules Restructured
Terraform Modules RestructuredTerraform Modules Restructured
Terraform Modules Restructured
 
Cassandra at teads
Cassandra at teadsCassandra at teads
Cassandra at teads
 
(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second
(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second
(BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second
 
Mesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run CassandraMesosphere and Contentteam: A New Way to Run Cassandra
Mesosphere and Contentteam: A New Way to Run Cassandra
 
pgday.seoul 2019: TimescaleDB
pgday.seoul 2019: TimescaleDBpgday.seoul 2019: TimescaleDB
pgday.seoul 2019: TimescaleDB
 
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
Deep Dive on Amazon EC2
Deep Dive on Amazon EC2Deep Dive on Amazon EC2
Deep Dive on Amazon EC2
 
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
(APP310) Scheduling Using Apache Mesos in the Cloud | AWS re:Invent 2014
 
Understanding DSE Search by Matt Stump
Understanding DSE Search by Matt StumpUnderstanding DSE Search by Matt Stump
Understanding DSE Search by Matt Stump
 
Multi-Region Cassandra Clusters
Multi-Region Cassandra ClustersMulti-Region Cassandra Clusters
Multi-Region Cassandra Clusters
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 

Similar to Speeding up R with Parallel Programming in the Cloud

Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
 
(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep DiveAmazon Web Services
 
Cassandra and Docker Lessons Learned
Cassandra and Docker Lessons LearnedCassandra and Docker Lessons Learned
Cassandra and Docker Lessons LearnedDataStax Academy
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudBrendan Gregg
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLOlivier Doucet
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsAshish Thapliyal
 
Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?Felix Geisendörfer
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13Dave Gardner
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13Dave Gardner
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.jsorkaplan
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...Amazon Web Services
 
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...NETWAYS
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Community
 

Similar to Speeding up R with Parallel Programming in the Cloud (20)

Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
 
(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive(DAT407) Amazon ElastiCache: Deep Dive
(DAT407) Amazon ElastiCache: Deep Dive
 
Cassandra and Docker Lessons Learned
Cassandra and Docker Lessons LearnedCassandra and Docker Lessons Learned
Cassandra and Docker Lessons Learned
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Performance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloudPerformance Analysis: new tools and concepts from the cloud
Performance Analysis: new tools and concepts from the cloud
 
Finding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQLFinding an unusual cause of max_user_connections in MySQL
Finding an unusual cause of max_user_connections in MySQL
 
Tips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight DeploymentsTips, Tricks & Best Practices for large scale HDInsight Deployments
Tips, Tricks & Best Practices for large scale HDInsight Deployments
 
Devops kc
Devops kcDevops kc
Devops kc
 
Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?
 
Planning to Fail #phpuk13
Planning to Fail #phpuk13Planning to Fail #phpuk13
Planning to Fail #phpuk13
 
Planning to Fail #phpne13
Planning to Fail #phpne13Planning to Fail #phpne13
Planning to Fail #phpne13
 
BigData Developers MeetUp
BigData Developers MeetUpBigData Developers MeetUp
BigData Developers MeetUp
 
introduction to node.js
introduction to node.jsintroduction to node.js
introduction to node.js
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
AWS APAC Webinar Week - AWS MySQL Relational Database Services Best Practices...
 
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
OSDC 2015: Mitchell Hashimoto | Automating the Modern Datacenter, Development...
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan XuCeph Goes on Online at Qihoo 360 - Xuehan Xu
Ceph Goes on Online at Qihoo 360 - Xuehan Xu
 

More from Revolution Analytics

Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution Analytics
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solutionRevolution Analytics
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceRevolution Analytics
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageRevolution Analytics
 

More from Revolution Analytics (20)

The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
Revolution R Enterprise 7.4 - Presentation by Bill Jacobs 11Jun15
 
Warranty Predictive Analytics solution
Warranty Predictive Analytics solutionWarranty Predictive Analytics solution
Warranty Predictive Analytics solution
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R ConferenceReproducibility with Checkpoint & RRO - NYC R Conference
Reproducibility with Checkpoint & RRO - NYC R Conference
 
Reproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint PackageReproducibility with Revolution R Open and the Checkpoint Package
Reproducibility with Revolution R Open and the Checkpoint Package
 

Recently uploaded

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 

Recently uploaded (20)

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Speeding up R with Parallel Programming in the Cloud

  • 1. Speeding up R with Parallel Programming in the Cloud David M Smith Developer Advocate, Microsoft @revodavid
  • 2. Hi. I’m David. • Cloud Developer Advocate at Microsoft – cda.ms/7f • Board Member, R Consortium • Editor, Revolutions blog – cda.ms/7g • Twitter: @revodavid
  • 3. Speeding up your R code • Vectorize 3 @revodavid • Vectorize • Moar megahertz! • Moar RAM! • Moar cores! • Moar computers! • Moar cloud!
  • 4. Embarassingly Parallel Problems Easy to speed things up when: • Calculating similar things many times – Iterations in a loop, chunks of data, … • Calculations are independent of each other • Each calculation takes a decent amount of time Just run multiple calculations at the same time 4 @revodavid
  • 5. The Birthday Paradox What is the likelihood that there are two people in this room who share the same birthday? 5 @revodavid
  • 6. Birthday Problem Simulator 6 pbirthdaysim <- function(n) { ntests <- 100000 pop <- 1:365 anydup <- function(i) any(duplicated( sample(pop, n, replace=TRUE))) sum(sapply(seq(ntests), anydup)) / ntests } bdayp <- sapply(1:100, pbirthdaysim)  About 3 minutes (on this laptop)
  • 7. library(foreach) Looping with the foreach package on CRAN – x is a list of results – each entry calculated from RHS of %dopar% Learn more about foreach: cda.ms/tc 7 @revodavid x <- foreach (n=1:100) %dopar% pbirthdaysim(n)
  • 8. Parallel processing with foreach • Change how processing is done by registering a backend – registerDoSEQ() sequential processing (default) – registerdoMC() local cores via multicore (Mac / Linux) – registerdoParallel() local cluster via library(parallel) – registerdoFuture() HPC schedulers incl LSF, OpenLava & Slurm – registerAzureParallel() remote cluster in Azure Batch • Whatever you use, call to foreach does not change – Also: no need to worry about data, packages etc. (mostly) 8 @revodavid
  • 9. Local multi-process: foreach + doParallel library(doParallel) cl <- makeCluster(2) # local cluster, 2 workers registerDoParallel(cl) bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n) Launches a separate R process for each task Works on all platforms (Windows / Mac / Unix) 9 @revodavid
  • 10. Single node multicore: doMC on 16-CPU VM 10 @revodavid > library(doMC) > registerdoMC(cluster) > system.time( bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n) ) user system elapsed 291.617 0.760 20.743 About 11x times faster than sequential • 20s vs 220s (using Azure DS5v2 instance) Note: disable multithreaded BLAS to avoid core contention • In Microsoft R Open: setMKLthreads(1)
  • 11. Cloud cluster: foreach + doAzureParallel doAzureParallel: A simple, open-source R package that uses the Azure Batch cluster service as a parallel-backend for foreach github.com/Azure/doAzureParallel
  • 12. Birthday simulation: cluster 8-node cluster (standard D2v2: 2 vCPU, 7 Gb) • specify VM class in cluster.json • specify credentials for Azure Batch and Azure Storage in credentials.json 12 @revodavid library(doAzureParallel) setCredentials("credentials.json") cluster <- makeCluster("cluster.json") registerDoAzureParallel(cluster) bdayp <- foreach(n=1:100) %dopar% pbirthdaysim(n) bdayp <- unlist(bdayp) cluster.json (excerpt): "name": "davidsmi8caret", "vmSize": "Standard_D2_v2", "maxTasksPerNode": 8, "poolSize": { "dedicatedNodes": { "min": 8, "max": 8 } 45 seconds (more than 5 times faster) on a warm start
  • 13. Cross-validation with caret • Most predictive modeling algorithms have “tuning parameters” • Example: Boosted Trees – Boosting iterations – Max Tree Depth – Shrinkage • Parameters affect model performance • Try ‘em out: cross-validate 14 @revodavid grid <- data.frame( nrounds = …, max_depth = …, gamma = …, colsample_bytree = …, min_child_weight = …, subsample = …) )
  • 14. Cross-validation in parallel • Caret’s train function will automatically use the registered foreach backend • Just register your cluster first: registerDoAzureParallel(cluster) • Handles sending objects, packages to nodes 15 @revodavid mod <- train( Class ~ ., data = dat, method = "xgbTree", trControl = ctrl, tuneGrid = grid, nthread = 1 )
  • 15. Low Priority Nodes • Low-Priority = (very) Low Costs VMs from surplus capacity – up to 80% discount • Clusters can mix dedicated VMs and low-priority VMs My Local R Session Azure Batch Low Priority VMs at up to 80% discount Dedicated VMs "poolSize": { "dedicatedNodes": { "min": 3, "max": 3 }, "lowPriorityNodes": { "min": 9, "max": 9 },
  • 16. caret speed-ups • Max Kuhn benchmarked various hardware and OS for local parallel – cda.ms/6V • Let’s see how it works with doAzureParallel 17 @revodavid Source: cda.ms/6V
  • 17. Packages and Containers • Docker images used to spawn nodes – Default: rocker/tidyverse:latest – Lots of R packages pre-installed • But this cross-validation also needs: – xgboost, e1071 • Easy fix: add to cluster.json 18 @revodavid { "name": "davidsmi8caret", "vmSize": "Standard_D2_v2", "maxTasksPerNode": 8, "poolSize": { "dedicatedNodes": { "min": 4, "max": 4 }, "lowPriorityNodes": { "min": 4, "max": 4 }, "autoscaleFormula": "QUEUE" }, "containerImage": "rocker/tidyverse:latest", "rPackages": { "cran": ["xgboost","e1071"], "github": [], "bioconductor": [] }, "commandLine": [] }
  • 18. 19 @revodavid =================================================================================== Id: job20180126022301 chunkSize: 1 enableCloudCombine: TRUE packages: caret; errorHandling: stop wait: TRUE autoDeleteJob: TRUE =================================================================================== Submitting tasks (1250/1250) Submitting merge task. . . Job Preparation Status: Package(s) being installed....... Waiting for tasks to complete. . . | Progress: 13.84% (173/1250) | Running: 59 | Queued: 1018 | Completed: 173 | Failed: 0 | MY LAPTOP: 78 minutes THIS CLUSTER: 16 minutes (almost 5x faster)
  • 19. Compute costs • Pay by the minute for VMs used in cluster • Using D2v2 Virtual Machines – Ubuntu 16, 7Mb RAM, 2-core “compute optimized” • 17 minutes × 8 VMs @ $0.14 / hour – about 32 cents (not counting startup or storage) 20 @revodavid
  • 20. Thank you! Find foreach on CRAN: CRAN.R-project.org/package=foreach Code for birthday problem simulation: cda.ms/7d Docs for the foreach package: cda.ms/tc Get doAzureParallel: cda.ms/7w Get Azure subscription with $200 free credits: cda.ms/td David Smith davidsmi@microsoft.com 22 @revodavid