SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Automating Data Science on Spark: 

A Bayesian Approach
Vu Pham, Huan Dao, Christopher Nguyen
San Francisco, June 8th 2016
@arimoinc
Feature
Engineering Model
Selection
Hyperparameter
Tuning
Data Science
is
so MANUAL!
@arimoinc
Natural
Interfaces
Machine
Learning
Collaboration
Agenda
@arimoinc
1. For Hyper-parameter Tuning
2. For “Automating” Data Science on Spark
3. Experiments
@arimoinc
Bayesian Optimization
for Hyper-parameter Tuning
@arimoinc
Feature
Engineering Model
Selection
Hyper-parameter
Tuning
We can
automate
this?
What if…
Hyper-parameters
@arimoinc
K-means K
Neural Network # of layers, dropout, momentum…
Random Forest Feature set, # of trees, max depth…
SVM Regularization term (C)
Gradient Descent Learning rate, number of iterations…
Manual Search
@arimoinc
Grid Search
@arimoinc
Random Search
@arimoinc
Hyper-parameters tuning
@arimoinc
U: S R
c U(c)
Hyper-parameter Space Performance onValidation Set
How to intelligently select the next configuration?
(Given the observations in the past)
Maximize a utility functions over hyper-parameter space:
@arimoinc
Bayesian Inference
Courtesy: http://www2.stat.duke.edu/~mw/fineart.html
Bayesian Optimization Explained
@arimoinc
1. Incorporate a prior over the space of possible objective functions (GP)
2. Combine the prior with the likelihood to obtain a posterior over
function values given observations
3. Select next configuration to evaluate based on the posterior
• According to an acquisition function
Bayesian Optimization Explained
@arimoinc
@arimoinc
expensive
multi-modal
noisy
blackbox
Bayesian Optimization is to
globally optimize functions that are:
@arimoinc
Bayesian Optimization for
“Automating” Data Science
on Spark
Automatic Data Science Workflow
@arimoinc
Feature
Engineering
Model
Selection
Hyperparameter
Tuning
Machines doing part of Data Science
@arimoinc
Dataset
Feature preprocessing
Data Cleaning
Generate new features
Feature selection
Dimensionality
reduction
Predictive modeling Scoring
A generic Machine Learning pipeline
@arimoinc
Training set
Val/Test set
Feature preprocessing
Truncated SVD
n: # of dims
f-score with target variable
Keep alpha percents top features
K-means
k: # of clusters
RandomForest #1
number of trees
maximum depth
percentage of features
RandomForest #k
…
Scoring
Spark WorkersSpark Workers
Pipeline on DDF and BayesOpt
@arimoinc
Spark Driver
DDF
Arimo Pipeline Manager
…
Client
Bayesian Optimizer
Client uses Bayesian Optimizer to select the hyper-parameters
of the pipeline so that it maximizes the performance on a validation set
Spark Workers
Pipeline on DDF and BayesOpt
@arimoinc
train_ddf = session.get_ddf(…)
valid_ddf = session.get_ddf(…)
optimizer = SpearmintOptimizer(chooser_name=‘GPEIperSecChooser',
max_finished_jobs=max_iters, grid_size=5000, ..)
best_params, trace = auto_model(
optimizer, train_ddf, 'arrdelay',
classification=True,
excluded_columns=['actualelapsedtime','arrtime', 'year'],
validation_ddf=val_ddf)
@arimoinc
Experimental Results
Experiment 1: SF Crimes
@arimoinc
Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745
2015-05-13 23:53:00
OTHER
OFFENSES
TRAFFICVIOLATION
ARREST
Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745
2015-05-13 23:33:00
OTHER
OFFENSES
TRAFFICVIOLATION
ARREST
Wednesday NORTHERN ARREST, BOOKED
VANNESS AV /
GREENWICH ST
-122.4243 37.8004
Experiment 1: SF Crimes dataset
@arimoinc
Hyper-parameter Type Range
Number of hidden layers INT 1, 2, 3
Number of hidden units INT 64, 128, 256
Dropout at the input layer FLOAT [0, 0.5]
Dropout at the hidden layers FLOAT [0,0.75]
Learning rate FLOAT [0.01, 0.1]
L2 Weight decay FLOAT [0, 0.01]
Logloss on the validation set
Running time (hours) ~ 40 iterations
Experiment 1: SF Crimes dataset
@arimoinc
Hyper-parameter Type Range Spearmint
Number of hidden layers INT 1, 2, 3 2
Number of hidden units INT 64, 128, 256 256
Dropout at the input layer FLOAT [0, 0.5] 0.423678
Dropout at the hidden layers FLOAT [0,0.75] 0.091693
Learning rate FLOAT [0.01, 0.1] 0.025994
L2 Weight decay FLOAT [0, 0.01] 0.00238
Logloss on the validation set 2.1502
Running time (hours) ~ 40 iterations 15.8
Experiment 1: SF Crimes dataset
@arimoinc
2.149
2.152
2.155
2.157
2.16
1 7 17 18 39
2.1600
2.1555 2.1552 2.1551
2.1519
2.1502
Completed jobs 1 7 16 17 18 39
Elapsed time 1021 8003 12742 1623 1983 31561
Number of layers 1 3 3 2 3 2
Hidden units 64 256 256 256 256 256
Learning rate 0.01 0.1 0.01 0.01 0.021 0.026
Input dropout 0 0.5 0.5 0.463 0.5 0.424
Hidden dropout 0 0 0 0.024 0.089 0.092
Weight decay 0 0 0.002 0.003 0 0.002
Experiment #1: With SigOpt
@arimoinc
Hyper-parameter Type Range Spearmint
Number of hidden layers INT 1, 2, 3 2 3
Number of hidden units INT 64, 128, 256 256 256
Dropout at the input layer FLOAT [0, 0.5] 0.423678 0.3141
Dropout at the hidden layers FLOAT [0,0.75] 0.091693 0.0944
Learning rate FLOAT [0.01, 0.1] 0.025994 0.0979
L2 Weight decay FLOAT [0, 0.01] 0.00238 0.0039
Logloss on the validation set 2.1502 2.14892
Running time (hours) ~ 40 iterations 15.8 20.1
SF Crimes - Time to results
@arimoinc
Experiment #2: Airlines data
@arimoinc
Year 1987-2008 DepDelay departure delay, in minutes
Month 1-12 Origin origin IATA airport code
DayofMonth 1-31 Dest destination IATA airport code
DayOfWeek 1 (Monday) - 7 (Sunday) Distance in miles
DepTime actual departure time TaxiIn taxi in time, in minutes
CRSDepTime scheduled departure time TaxiOut taxi out time in minutes
ArrTime actual arrival time Cancelled was the flight cancelled?
CRSArrTime scheduled arrival time CancellationCode reason for cancellation
UniqueCarrier unique carrier code Diverted 1 = yes, 0 = no
FlightNum flight number CarrierDelay in minutes
TailNum plane tail number WeatherDelay in minutes
ActualElapsedTime in minutes NASDelay in minutes
CRSElapsedTime in minutes SecurityDelay in minutes
AirTime in minutes LateAircraftDelay in minutes
ArrDelay arrival delay, in minutes Delayed Is the flight delayed
Experiment #2
@arimoinc
Training set
Val/Test set
Feature preprocessing
~900 features
Truncated SVD
n: # of dims
f-score with target variable
Keep alpha percents top features
K-means
k: # of clusters
RandomForest #1
number of trees
maximum depth
percentage of features
RandomForest #k
…
Scoring
Experiment #2: Hyper-parameters
@arimoinc
Hyperparameter Type Range BayesOpt
Number of SVD dimensions INT [5, 100] 98
Top feature percentage FLOAT [0.1, 1] 0.8258
k (# of clusters) INT [1, 6] 2
Number of trees (RF) INT [50, 500] 327
Max. depth (RF) INT [1, 20] 12
Min. instances per node (RF) INT [1, 1000] 414
F1-score on validation set 0.8736
Summary
@arimoinc
1. Bayesian Optimization for Hyper-parameter Tuning
2. Bayesian Optimization for 

“Automating” Data Science on Spark
3. Experiments
Getting Started
@arimoinc
• Blogpost: http://goo.gl/PFyBKI
• Open-source: spearmint, hyperopt, SMAC, AutoML
• Commercial: Whetlab, SigOpt, …
http://goo.gl/PFyBKI
https://www.arimo.com
@arimoinc @pentagoniac @phvu
CHECK IT OUT!

Weitere ähnliche Inhalte

Was ist angesagt?

GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemDanny Yuan
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using FlinkDongwon Kim
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING JournalPandey_G
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Flink Forward
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time togetherTed Dunning
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINEDB
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017MLconf
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing FrameworksSirKetchup
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flinkFlink Forward
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...Anis Nasir
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...InfluxData
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 

Was ist angesagt? (20)

GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
QConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing systemQConSF 2014 talk on Netflix Mantis, a stream processing system
QConSF 2014 talk on Netflix Mantis, a stream processing system
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Real-time driving score service using Flink
Real-time driving score service using FlinkReal-time driving score service using Flink
Real-time driving score service using Flink
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Super COMPUTING Journal
Super COMPUTING JournalSuper COMPUTING Journal
Super COMPUTING Journal
 
Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School Vasia Kalavri – Training: Gelly School
Vasia Kalavri – Training: Gelly School
 
Real-time and long-time together
Real-time and long-time togetherReal-time and long-time together
Real-time and long-time together
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
Tianqi Chen, PhD Student, University of Washington, at MLconf Seattle 2017
 
Stream Processing Frameworks
Stream Processing FrameworksStream Processing Frameworks
Stream Processing Frameworks
 
Michael Häusler – Everyday flink
Michael Häusler – Everyday flinkMichael Häusler – Everyday flink
Michael Häusler – Everyday flink
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
The Power of Both Choices: Practical Load Balancing for Distributed Stream Pr...
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 

Ähnlich wie Automatic Features Generation And Model Training On Spark: A Bayesian Approach

AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...Amazon Web Services
 
The First IoT JSR: Units of Measurement JSR-363 [BOF5981]
The First IoT JSR: Units of Measurement JSR-363 [BOF5981]The First IoT JSR: Units of Measurement JSR-363 [BOF5981]
The First IoT JSR: Units of Measurement JSR-363 [BOF5981]Leonardo De Moura Rocha Lima
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
 
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?Marlon Dumas
 
Automated Verification of an Onboard Mission Planning and Execution System fo...
Automated Verification of an Onboard Mission Planning and Execution System fo...Automated Verification of an Onboard Mission Planning and Execution System fo...
Automated Verification of an Onboard Mission Planning and Execution System fo...Florian-Michael Adolf
 
Automation Anywhere - Imagine New York 2019 - Verizon
Automation Anywhere - Imagine New York 2019 - VerizonAutomation Anywhere - Imagine New York 2019 - Verizon
Automation Anywhere - Imagine New York 2019 - VerizonAutomation Anywhere
 
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Joseph Luchette
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done rightFred Moyer
 
How to avoid the latency trap and lessons about software design
How to avoid the latency trap and lessons  about software designHow to avoid the latency trap and lessons  about software design
How to avoid the latency trap and lessons about software designTom Croucher
 
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with KubernetesSolve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with KubernetesNiklas Quarfot Nielsen
 
Monitoring Clojure Applications with Prometheus
Monitoring Clojure Applications with PrometheusMonitoring Clojure Applications with Prometheus
Monitoring Clojure Applications with PrometheusJoachim Draeger
 
Quantifying the Value of Static Analysis
Quantifying the Value of Static AnalysisQuantifying the Value of Static Analysis
Quantifying the Value of Static AnalysisTechWell
 
Full accesspolicyconsolidation for event processing systems
Full accesspolicyconsolidation for event processing systemsFull accesspolicyconsolidation for event processing systems
Full accesspolicyconsolidation for event processing systemsviswanadhamsatish
 
Sumo Logic QuickStart Webinar
Sumo Logic QuickStart WebinarSumo Logic QuickStart Webinar
Sumo Logic QuickStart WebinarSumo Logic
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...MSAdvAnalytics
 
AEMP Connect 2021 Can AI Solve Construction Telematics Overload Problem? Ode...
AEMP Connect 2021  Can AI Solve Construction Telematics Overload Problem? Ode...AEMP Connect 2021  Can AI Solve Construction Telematics Overload Problem? Ode...
AEMP Connect 2021 Can AI Solve Construction Telematics Overload Problem? Ode...Oded Ran
 
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...Lionel Briand
 
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Shay Ginsbourg
 
Storage Consumption and Chargeback
Storage Consumption and ChargebackStorage Consumption and Chargeback
Storage Consumption and Chargebackbrettallison
 
Meetup web scale architecture quantum computing (Part 1 16-10-2018)
Meetup web scale architecture quantum computing (Part 1 16-10-2018)Meetup web scale architecture quantum computing (Part 1 16-10-2018)
Meetup web scale architecture quantum computing (Part 1 16-10-2018)Rolf Huisman
 

Ähnlich wie Automatic Features Generation And Model Training On Spark: A Bayesian Approach (20)

AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
AWS re:Invent 2016: IoT Blueprints: Optimizing Supply for Smart Agriculture f...
 
The First IoT JSR: Units of Measurement JSR-363 [BOF5981]
The First IoT JSR: Units of Measurement JSR-363 [BOF5981]The First IoT JSR: Units of Measurement JSR-363 [BOF5981]
The First IoT JSR: Units of Measurement JSR-363 [BOF5981]
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
 
Automated Verification of an Onboard Mission Planning and Execution System fo...
Automated Verification of an Onboard Mission Planning and Execution System fo...Automated Verification of an Onboard Mission Planning and Execution System fo...
Automated Verification of an Onboard Mission Planning and Execution System fo...
 
Automation Anywhere - Imagine New York 2019 - Verizon
Automation Anywhere - Imagine New York 2019 - VerizonAutomation Anywhere - Imagine New York 2019 - Verizon
Automation Anywhere - Imagine New York 2019 - Verizon
 
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
Unlimited Virtual Computing Capacity using the Cloud for Automated Parameter ...
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
How to avoid the latency trap and lessons about software design
How to avoid the latency trap and lessons  about software designHow to avoid the latency trap and lessons  about software design
How to avoid the latency trap and lessons about software design
 
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with KubernetesSolve the colocation conundrum: Performance and density at scale with Kubernetes
Solve the colocation conundrum: Performance and density at scale with Kubernetes
 
Monitoring Clojure Applications with Prometheus
Monitoring Clojure Applications with PrometheusMonitoring Clojure Applications with Prometheus
Monitoring Clojure Applications with Prometheus
 
Quantifying the Value of Static Analysis
Quantifying the Value of Static AnalysisQuantifying the Value of Static Analysis
Quantifying the Value of Static Analysis
 
Full accesspolicyconsolidation for event processing systems
Full accesspolicyconsolidation for event processing systemsFull accesspolicyconsolidation for event processing systems
Full accesspolicyconsolidation for event processing systems
 
Sumo Logic QuickStart Webinar
Sumo Logic QuickStart WebinarSumo Logic QuickStart Webinar
Sumo Logic QuickStart Webinar
 
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
Cortana Analytics Workshop: Real-Time Data Processing -- How Do I Choose the ...
 
AEMP Connect 2021 Can AI Solve Construction Telematics Overload Problem? Ode...
AEMP Connect 2021  Can AI Solve Construction Telematics Overload Problem? Ode...AEMP Connect 2021  Can AI Solve Construction Telematics Overload Problem? Ode...
AEMP Connect 2021 Can AI Solve Construction Telematics Overload Problem? Ode...
 
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
Achieving Scalability in Software Testing with Machine Learning and Metaheuri...
 
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
Ginsbourg.com - Performance and Load Test Report Template LTR 1.2
 
Storage Consumption and Chargeback
Storage Consumption and ChargebackStorage Consumption and Chargeback
Storage Consumption and Chargeback
 
Meetup web scale architecture quantum computing (Part 1 16-10-2018)
Meetup web scale architecture quantum computing (Part 1 16-10-2018)Meetup web scale architecture quantum computing (Part 1 16-10-2018)
Meetup web scale architecture quantum computing (Part 1 16-10-2018)
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss ConfederationEfruzAsilolu
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 

Automatic Features Generation And Model Training On Spark: A Bayesian Approach

  • 1. Automating Data Science on Spark: 
 A Bayesian Approach Vu Pham, Huan Dao, Christopher Nguyen San Francisco, June 8th 2016
  • 4. Agenda @arimoinc 1. For Hyper-parameter Tuning 2. For “Automating” Data Science on Spark 3. Experiments
  • 7. Hyper-parameters @arimoinc K-means K Neural Network # of layers, dropout, momentum… Random Forest Feature set, # of trees, max depth… SVM Regularization term (C) Gradient Descent Learning rate, number of iterations…
  • 11. Hyper-parameters tuning @arimoinc U: S R c U(c) Hyper-parameter Space Performance onValidation Set How to intelligently select the next configuration? (Given the observations in the past) Maximize a utility functions over hyper-parameter space:
  • 13. Bayesian Optimization Explained @arimoinc 1. Incorporate a prior over the space of possible objective functions (GP) 2. Combine the prior with the likelihood to obtain a posterior over function values given observations 3. Select next configuration to evaluate based on the posterior • According to an acquisition function
  • 17. Automatic Data Science Workflow @arimoinc Feature Engineering Model Selection Hyperparameter Tuning
  • 18. Machines doing part of Data Science @arimoinc Dataset Feature preprocessing Data Cleaning Generate new features Feature selection Dimensionality reduction Predictive modeling Scoring
  • 19. A generic Machine Learning pipeline @arimoinc Training set Val/Test set Feature preprocessing Truncated SVD n: # of dims f-score with target variable Keep alpha percents top features K-means k: # of clusters RandomForest #1 number of trees maximum depth percentage of features RandomForest #k … Scoring
  • 20. Spark WorkersSpark Workers Pipeline on DDF and BayesOpt @arimoinc Spark Driver DDF Arimo Pipeline Manager … Client Bayesian Optimizer Client uses Bayesian Optimizer to select the hyper-parameters of the pipeline so that it maximizes the performance on a validation set Spark Workers
  • 21. Pipeline on DDF and BayesOpt @arimoinc train_ddf = session.get_ddf(…) valid_ddf = session.get_ddf(…) optimizer = SpearmintOptimizer(chooser_name=‘GPEIperSecChooser', max_finished_jobs=max_iters, grid_size=5000, ..) best_params, trace = auto_model( optimizer, train_ddf, 'arrdelay', classification=True, excluded_columns=['actualelapsedtime','arrtime', 'year'], validation_ddf=val_ddf)
  • 23. Experiment 1: SF Crimes @arimoinc Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745 2015-05-13 23:53:00 OTHER OFFENSES TRAFFICVIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745 2015-05-13 23:33:00 OTHER OFFENSES TRAFFICVIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.4243 37.8004
  • 24. Experiment 1: SF Crimes dataset @arimoinc Hyper-parameter Type Range Number of hidden layers INT 1, 2, 3 Number of hidden units INT 64, 128, 256 Dropout at the input layer FLOAT [0, 0.5] Dropout at the hidden layers FLOAT [0,0.75] Learning rate FLOAT [0.01, 0.1] L2 Weight decay FLOAT [0, 0.01] Logloss on the validation set Running time (hours) ~ 40 iterations
  • 25. Experiment 1: SF Crimes dataset @arimoinc Hyper-parameter Type Range Spearmint Number of hidden layers INT 1, 2, 3 2 Number of hidden units INT 64, 128, 256 256 Dropout at the input layer FLOAT [0, 0.5] 0.423678 Dropout at the hidden layers FLOAT [0,0.75] 0.091693 Learning rate FLOAT [0.01, 0.1] 0.025994 L2 Weight decay FLOAT [0, 0.01] 0.00238 Logloss on the validation set 2.1502 Running time (hours) ~ 40 iterations 15.8
  • 26. Experiment 1: SF Crimes dataset @arimoinc 2.149 2.152 2.155 2.157 2.16 1 7 17 18 39 2.1600 2.1555 2.1552 2.1551 2.1519 2.1502 Completed jobs 1 7 16 17 18 39 Elapsed time 1021 8003 12742 1623 1983 31561 Number of layers 1 3 3 2 3 2 Hidden units 64 256 256 256 256 256 Learning rate 0.01 0.1 0.01 0.01 0.021 0.026 Input dropout 0 0.5 0.5 0.463 0.5 0.424 Hidden dropout 0 0 0 0.024 0.089 0.092 Weight decay 0 0 0.002 0.003 0 0.002
  • 27. Experiment #1: With SigOpt @arimoinc Hyper-parameter Type Range Spearmint Number of hidden layers INT 1, 2, 3 2 3 Number of hidden units INT 64, 128, 256 256 256 Dropout at the input layer FLOAT [0, 0.5] 0.423678 0.3141 Dropout at the hidden layers FLOAT [0,0.75] 0.091693 0.0944 Learning rate FLOAT [0.01, 0.1] 0.025994 0.0979 L2 Weight decay FLOAT [0, 0.01] 0.00238 0.0039 Logloss on the validation set 2.1502 2.14892 Running time (hours) ~ 40 iterations 15.8 20.1
  • 28. SF Crimes - Time to results @arimoinc
  • 29. Experiment #2: Airlines data @arimoinc Year 1987-2008 DepDelay departure delay, in minutes Month 1-12 Origin origin IATA airport code DayofMonth 1-31 Dest destination IATA airport code DayOfWeek 1 (Monday) - 7 (Sunday) Distance in miles DepTime actual departure time TaxiIn taxi in time, in minutes CRSDepTime scheduled departure time TaxiOut taxi out time in minutes ArrTime actual arrival time Cancelled was the flight cancelled? CRSArrTime scheduled arrival time CancellationCode reason for cancellation UniqueCarrier unique carrier code Diverted 1 = yes, 0 = no FlightNum flight number CarrierDelay in minutes TailNum plane tail number WeatherDelay in minutes ActualElapsedTime in minutes NASDelay in minutes CRSElapsedTime in minutes SecurityDelay in minutes AirTime in minutes LateAircraftDelay in minutes ArrDelay arrival delay, in minutes Delayed Is the flight delayed
  • 30. Experiment #2 @arimoinc Training set Val/Test set Feature preprocessing ~900 features Truncated SVD n: # of dims f-score with target variable Keep alpha percents top features K-means k: # of clusters RandomForest #1 number of trees maximum depth percentage of features RandomForest #k … Scoring
  • 31. Experiment #2: Hyper-parameters @arimoinc Hyperparameter Type Range BayesOpt Number of SVD dimensions INT [5, 100] 98 Top feature percentage FLOAT [0.1, 1] 0.8258 k (# of clusters) INT [1, 6] 2 Number of trees (RF) INT [50, 500] 327 Max. depth (RF) INT [1, 20] 12 Min. instances per node (RF) INT [1, 1000] 414 F1-score on validation set 0.8736
  • 32. Summary @arimoinc 1. Bayesian Optimization for Hyper-parameter Tuning 2. Bayesian Optimization for 
 “Automating” Data Science on Spark 3. Experiments
  • 33. Getting Started @arimoinc • Blogpost: http://goo.gl/PFyBKI • Open-source: spearmint, hyperopt, SMAC, AutoML • Commercial: Whetlab, SigOpt, …