SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Rise of Scalable Machine Learning
at Yahoo
A n d y F e n g
V P A r c h i t ect ur e , Ya h o o
My Talks @ Hadoop Summit
2
 Storm (2013)
 Spark (2014)
 Machine Learning (2015)
3
Use Case: Search & Advertisement
 Application needs
â€ș Content ranking
â€ș Ad click prediction
â€ș Query-Ads matching
 Machine learning algorithm
â€ș Gradient boosted decision tree
â€ș Logistic regression
â€ș Neural network
Challenge: Scale
4
1. Massive amount of examples
â€ș NaĂŻve solutions take days/weeks
2. Billions of features
â€ș Model exceeds memory limits of 1 computer
3. Variety of algorithms
â€ș Different solutions required for scale-up
Massive Hadoop at Yahoo
5
600 PB
HDFS
43K
Computers
MACHINE
LEARNING
Big-Data ML in Action
6
ML learner
ML server
Map
Reduce
7
Architecture for Scalable ML
 ML Server
â€ș Customized in-memory
stores (Hashmap, Matrix)
‱ Lockless concurrency
‱ Zero garbage created
â€ș Map/Reduce API to move
computing to servers
3 Examples of ML Algorithms
8 Yahoo Confidential & Proprietary
1. Gradient Boosted Decision Tree
â€ș Problem: Training latency
â€ș Solution: Hadoop streaming + MPI
2. Logistic Regression
â€ș Problem: Model size
â€ș Solution: Spark + ML Server
3. Ad-Query Vectors
â€ș Problem: Model size + Training latency
â€ș Solution: Spark + ML Server
Algorithm 1: Gradient Boosted Decision Tree
 Boosting is sequential  Training takes days for
1000s of features
Gradient Boosted Decision Tree: 30x Speed-up
Algorithm 2: Logistic Regression
11
When |ÎČ| > 100B,
â€ș 100 Billion * 16 Bytes = 1.6 TB
â€ș ÎČ exceeds memory limit of 1 computer
Logistic Regression: 1000x Scale-up
12
 Vector: numeric representation
of queries/ads
â€ș Vector(“san jose weather”) ≈ Vector(“weather 95113”)
≈ Vector(ad123)
 Model size
â€ș 1 Billion* 300 dimensions = 2.4TB
 Vector computation (X*Y, aX+Y)
â€ș Took weeks for small datasets
13 * Yahoo Labs: http://bit.ly/1G3f6L2
Algorithm 3: Ad-Query Vectors
Ad-Query2Vec: 100x Speed/Scale-up
14
 Computation on servers
â€ș (1) Negative sampling
â€ș (1) Compute gradient: X*Y
â€ș (3) Adjust vectors: Y=aX+Y
 Daily training enabled
â€ș weeks  hours
 Asynchronous
 Faster
 More data
 Larger model
15
Lesson Learned: Approximate Computing  Better Accuracy


Summary
16
 Scalable machine learning at Yahoo
â€ș critical business: search, advertisement
â€ș daily model training w/ billions of features
 Hadoop/YARN plays a central role
â€ș approximate computing
â€ș CPU + GPU
17
Thank You!
afeng@apache.org

Weitere Àhnliche Inhalte

Was ist angesagt?

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit
 
Snorkel: Dark Data and Machine Learning with Christopher RĂ©
Snorkel: Dark Data and Machine Learning with Christopher RĂ©Snorkel: Dark Data and Machine Learning with Christopher RĂ©
Snorkel: Dark Data and Machine Learning with Christopher RĂ©Jen Aman
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19Databricks
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissSpark Summit
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1Kenta Oono
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf
 

Was ist angesagt? (20)

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Snorkel: Dark Data and Machine Learning with Christopher RĂ©
Snorkel: Dark Data and Machine Learning with Christopher RĂ©Snorkel: Dark Data and Machine Learning with Christopher RĂ©
Snorkel: Dark Data and Machine Learning with Christopher RĂ©
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 

Andere mochten auch

Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...Jian Xu
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Badri Narayan Bhaskar
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...Universitat PolitĂšcnica de Catalunya
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit
 
Kerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTENKerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTENKerry Karl
 
5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life SupportWishpond
 
Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015Daniel Woods
 
Getting Open Data Used
Getting Open Data UsedGetting Open Data Used
Getting Open Data UsedAndrew Stott
 
Impacto de las tics en la educaciĂČn
Impacto de las tics en la educaciĂČnImpacto de las tics en la educaciĂČn
Impacto de las tics en la educaciĂČnDarĂŹo Miranda S.A
 
Polyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and PlayPolyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and PlayEvgeny Goldin
 
Auktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva online
Auktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva onlineAuktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva online
Auktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva onlineMartin Husovec
 

Andere mochten auch (20)

Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Kerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTENKerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTEN
 
5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support
 
Brexit Webinar Series 3
Brexit Webinar Series 3Brexit Webinar Series 3
Brexit Webinar Series 3
 
Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015
 
Getting Open Data Used
Getting Open Data UsedGetting Open Data Used
Getting Open Data Used
 
Impacto de las tics en la educaciĂČn
Impacto de las tics en la educaciĂČnImpacto de las tics en la educaciĂČn
Impacto de las tics en la educaciĂČn
 
Polyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and PlayPolyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and Play
 
1 4 vamos a jugar
1 4 vamos a jugar1 4 vamos a jugar
1 4 vamos a jugar
 
Auktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva online
Auktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva onlineAuktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva online
Auktuålne otåzky zodpovednosti za poruƥovanie pråv duƥevného vlastníctva online
 

Ähnlich wie Surge: Rise of Scalable Machine Learning at Yahoo!

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the CloudMapR Technologies
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingDatabricks
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoopgluent.
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cachesqlserver.co.il
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create PyData
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917Bill Liu
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningMakoto Yui
 
Machine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceMachine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceScaleway
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data TestingQA InfoTech
 
Relevance of time series databases & druid.io
Relevance of time series databases & druid.ioRelevance of time series databases & druid.io
Relevance of time series databases & druid.ioMuniraju V
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNOTu Pham
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreAmazon Web Services
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks
 
Modern recommender system in large content website
Modern recommender system in large content websiteModern recommender system in large content website
Modern recommender system in large content websiteCyrus Chien-Ching Chiu
 

Ähnlich wie Surge: Rise of Scalable Machine Learning at Yahoo! (20)

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Machine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceMachine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the race
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
 
Relevance of time series databases & druid.io
Relevance of time series databases & druid.ioRelevance of time series databases & druid.io
Relevance of time series databases & druid.io
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Modern recommender system in large content website
Modern recommender system in large content websiteModern recommender system in large content website
Modern recommender system in large content website
 

Mehr von DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

KĂŒrzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

KĂŒrzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Surge: Rise of Scalable Machine Learning at Yahoo!

  • 1. Rise of Scalable Machine Learning at Yahoo A n d y F e n g V P A r c h i t ect ur e , Ya h o o
  • 2. My Talks @ Hadoop Summit 2  Storm (2013)  Spark (2014)  Machine Learning (2015)
  • 3. 3 Use Case: Search & Advertisement  Application needs â€ș Content ranking â€ș Ad click prediction â€ș Query-Ads matching  Machine learning algorithm â€ș Gradient boosted decision tree â€ș Logistic regression â€ș Neural network
  • 4. Challenge: Scale 4 1. Massive amount of examples â€ș NaĂŻve solutions take days/weeks 2. Billions of features â€ș Model exceeds memory limits of 1 computer 3. Variety of algorithms â€ș Different solutions required for scale-up
  • 5. Massive Hadoop at Yahoo 5 600 PB HDFS 43K Computers MACHINE LEARNING
  • 6. Big-Data ML in Action 6 ML learner ML server Map Reduce
  • 7. 7 Architecture for Scalable ML  ML Server â€ș Customized in-memory stores (Hashmap, Matrix) ‱ Lockless concurrency ‱ Zero garbage created â€ș Map/Reduce API to move computing to servers
  • 8. 3 Examples of ML Algorithms 8 Yahoo Confidential & Proprietary 1. Gradient Boosted Decision Tree â€ș Problem: Training latency â€ș Solution: Hadoop streaming + MPI 2. Logistic Regression â€ș Problem: Model size â€ș Solution: Spark + ML Server 3. Ad-Query Vectors â€ș Problem: Model size + Training latency â€ș Solution: Spark + ML Server
  • 9. Algorithm 1: Gradient Boosted Decision Tree  Boosting is sequential  Training takes days for 1000s of features
  • 10. Gradient Boosted Decision Tree: 30x Speed-up
  • 11. Algorithm 2: Logistic Regression 11 When |ÎČ| > 100B, â€ș 100 Billion * 16 Bytes = 1.6 TB â€ș ÎČ exceeds memory limit of 1 computer
  • 13.  Vector: numeric representation of queries/ads â€ș Vector(“san jose weather”) ≈ Vector(“weather 95113”) ≈ Vector(ad123)  Model size â€ș 1 Billion* 300 dimensions = 2.4TB  Vector computation (X*Y, aX+Y) â€ș Took weeks for small datasets 13 * Yahoo Labs: http://bit.ly/1G3f6L2 Algorithm 3: Ad-Query Vectors
  • 14. Ad-Query2Vec: 100x Speed/Scale-up 14  Computation on servers â€ș (1) Negative sampling â€ș (1) Compute gradient: X*Y â€ș (3) Adjust vectors: Y=aX+Y  Daily training enabled â€ș weeks  hours
  • 15.  Asynchronous  Faster  More data  Larger model 15 Lesson Learned: Approximate Computing  Better Accuracy 

  • 16. Summary 16  Scalable machine learning at Yahoo â€ș critical business: search, advertisement â€ș daily model training w/ billions of features  Hadoop/YARN plays a central role â€ș approximate computing â€ș CPU + GPU

Hinweis der Redaktion

  1. Good afternoon. I am Andy Feng from Yahoo. In this talk, I will share our recent effort to enable large scale machine learning on hadoop clusters.
  2. In 2013, I talked about Yahoo’s adoption of Storm for low-latency processing. Last year, I described Yahoo’s effort to bring Spark onto YARN cluster. Today, we should cover our progress on machine learning using YARN clusters. I will cover 3 areas: WHY does Yahoo apply machine learning WHAT challenges we try to address HOW we address them I will wrap the talk with key lessons learned from our experience.
  3. Let’s start with WHY machine learning. Search is one of the key applications for Yahoo. For a user’s search phrase, we construct a result page with organic contents together with ads. To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click. Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
  4. Machine learning at Yahoo has an scalability challenge. 1st, the # of training examples. In order to produce an accurate machine learning model, Yahoo examines massive amount of training examples. For example, we examine several months of user search activity logs. Typically, we are looking at hundreds billions of training examples. When naïve solutions are applied, the training process could take several weeks. Models that represents what happened weeks ago have limited business value, since they don’t represent the current state of our users and contents. 2nd the # of features. We need to pick up signals from all possible signals. It’s usual for Yahoo to consider billions of features in our model. 3rd, we use variety of algorithms, and different solutions are required for scaling up these algorithms. We want our machine learning algorithms massive scalable.
  5. We believe that Hadoop is an ideal platform for scalable machine learning. Yahoo has one of the largest Hadoop deployments in the world. At the moment we store 600 PB on 43 thousands nodes. In last year, we decide to make hadoop the single best system for running large scale machine learning applications. So lets look into a bit under the hood
  6. At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily. Here is a screenshot from one Hadoop cluster. In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
  7. ML Server Customized in-memory stores (Hashmap, Matrix) Lockless concurrency Zero garbage created Map/Reduce API to move computing to servers To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers. These servers are a YARN application, specfically design for machine learning. All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second. Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection. Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners. To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers. As an example, you may want to perform statistic analysis of large models using MapReduce operations. Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
  8. Let me share 3 success stories about machine learning algorithms. Our 1st story illustrates how Hadoop and MPI could dramatically reduce training latency. Our 2nd story shows Spark and YARN to enable training of very large machine learning models. Our 3rd story will attack both model size and training latency.
  9. Let’s start with our 1st story. In gradient boosted decision trees, we represent model as a collection of decision trees. Each tree node represents a decision point using one feature. By top-down walk through of these trees, you will reach leaf nodes with numerical values. Adding those numerical values together will be our prediction for a given example. To achieve high accuracy, we tend to have many trees. In Yahoo use cases, we use thousands of trees. To construct such trees, we have to construct trees one-by-one. Within each trees, we have to tree layer-by-layer. For each node, we need to select a best feature and a best value for the split. If you use a single machine, the training process could take several days.
  10. At Yahoo, we developed a GBDT algorithm on top of Hadoop and MPI, and achieve 30X speed-up. More specifically, we use Hadoop Streaming to launch multiple GBDT workers. We partition training examples by columns, instead of rows. Each worker has subset of features for all training examples. During the training, each worker perform local computation to identical best splits for its feature set. We then apply MPI allreduce operation to decide the global best split among all features, and broadcast the best split to all workers. We repeat this process for all tree nodes. At the end, we have a collection of trees as our model. In this approach, we could tens of Hadoop mappers, and fully utilize their computation power. We achieved 30 times speedup for Yahoo use cases. For a training job previously took days, we could now produce decision trees in about 1 hour.
  11. Our 2nd story is about logistic regression. For a given vector X, logistic regression predicts the outcome via a logistic function to the dot product of parameters beta and feature vector. During training phase, we try to find the best parameter beta from training examples. If we have 2 parameters, logistic regression is find a line in 2 dimensional space to best fit our examples. The scalability challenge is around the # of parameters. We want to have produce model with over 100B parameters. Assume that each parameter uses 16 bytes, the storage of our parameters require 1.6 TB. We could not store the model in memory of a single computer.
  12. To enable large models, we decide to use multiple servers in a YARN cluster. Each server keeps a subset of parameters in memory. We launch logistic regression learners as a Spark job on YARN. Each learner will cover a subset of training examples from HDFS. For each example, we will fetch current parameter values from servers, compute gradient, and update servers with latest value. This new architecture enables us to scale up learning 1000 times. Our previous models had at thousands or millions parameters, and our new model now has billions of parameters. All learners are perform learning independently. There is no synchronous across learners at all. Therefore, we could learn from massive amount of training data very quickly. As a result, our model w/ billions of parameters is significantly more precise than our previous models. That has brought us meaningful business impact.
  13. Our 3rd story is related to search query and ads. In this case, we are learning numerical vectors of search queries and ads from user session logs. From these vectors, we will be able to know that query terms “san jose weather” and “weather 95113” are essentially identical. We learn vectors from user’s search sessions. Each search session will have a collection of query terms and ads. Ad/query vectors are learned by applying n-gram techniques. Details of our algorithm is explained in a recent conference paper from yahoo labs. In this use case, we have 2 problems. First, we have billions of query terms. If each vector has 300 dimensions, we will need 2.4 TB memory space for vector storage. That’s way beyond our typical computer today. Vector calculation is very expensive. For each search sessions, we need to perform hundreds vector operations such as multiplication and addition. For a relative small datasets, training could take weeks.
  14. For computing vectors of queries and ads, we use a set of matrix servers on YARN cluster. Each server has a subset of columns of our matrix. These servers has built-in matrix operations such as vector multiplication and addition. We use Spark job to launch multiple learners on a YARN cluster. Each learner will examine a subset of training dataset. To reduce data movement, we conduct majority of computation on servers. For each training example, we let each server to produce negative examples, and calculate gradients locally. Then, our learner calculate a global coefficence based on each server’s partial gradients. Finally, we let each server adjust vectors. This distributed solution enables us to train vectors within a few hours. Remember it took several weeks for such a task previously. 100X speedup using YARN.
  15. From these use cases, we learned one important lesson. That is, big-data approximate computing could produce more accurate models. In all use cases, we use a set of computers to learning from dataset, and produce a mathematical model. We want each learners to conduct their learning as fast as they can. We don’t want any synchronization across learners. We even let learners to overwrite each other in the shared data model. Each execution may produce slightly different result. We are performing approximate computing on YARN. At the end, we produce a mathematical model with large # of parameters. Since this model represent the signals from massive amount of data, our model is more accurate than previous model built from precise computing.
  16. In summary, Yahoo has made significant progress on scalable machine learning. We conduct daily training w/ billions of signals for our critical business such as search and advertisement. Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing. We are currently exploring both GPU and CPU in a single cluster.