Surge: Rise of Scalable Machine Learning at Yahoo!

•

7 gefällt mir•5,317 views

DataWorks Summit

Hadoop Summit 2015

Rise of Scalable Machine Learning
at Yahoo
A n d y F e n g
V P A r c h i t ect ur e , Ya h o o

My Talks @ Hadoop Summit
2
 Storm (2013)
 Spark (2014)
 Machine Learning (2015)

3
Use Case: Search & Advertisement
 Application needs
› Content ranking
› Ad click prediction
› Query-Ads matching
 Machine learning algorithm
› Gradient boosted decision tree
› Logistic regression
› Neural network

Challenge: Scale
4
1. Massive amount of examples
› Naïve solutions take days/weeks
2. Billions of features
› Model exceeds memory limits of 1 computer
3. Variety of algorithms
› Different solutions required for scale-up

Massive Hadoop at Yahoo
5
600 PB
HDFS
43K
Computers
MACHINE
LEARNING

Big-Data ML in Action
6
ML learner
ML server
Map
Reduce

7
Architecture for Scalable ML
 ML Server
› Customized in-memory
stores (Hashmap, Matrix)
• Lockless concurrency
• Zero garbage created
› Map/Reduce API to move
computing to servers

3 Examples of ML Algorithms
8 Yahoo Confidential & Proprietary
1. Gradient Boosted Decision Tree
› Problem: Training latency
› Solution: Hadoop streaming + MPI
2. Logistic Regression
› Problem: Model size
› Solution: Spark + ML Server
3. Ad-Query Vectors
› Problem: Model size + Training latency
› Solution: Spark + ML Server

Algorithm 1: Gradient Boosted Decision Tree
 Boosting is sequential  Training takes days for
1000s of features

Gradient Boosted Decision Tree: 30x Speed-up

Algorithm 2: Logistic Regression
11
When |β| > 100B,
› 100 Billion * 16 Bytes = 1.6 TB
› β exceeds memory limit of 1 computer

Logistic Regression: 1000x Scale-up
12

 Vector: numeric representation
of queries/ads
› Vector(“san jose weather”) ≈ Vector(“weather 95113”)
≈ Vector(ad123)
 Model size
› 1 Billion* 300 dimensions = 2.4TB
 Vector computation (X*Y, aX+Y)
› Took weeks for small datasets
13 * Yahoo Labs: http://bit.ly/1G3f6L2
Algorithm 3: Ad-Query Vectors

Ad-Query2Vec: 100x Speed/Scale-up
14
 Computation on servers
› (1) Negative sampling
› (1) Compute gradient: X*Y
› (3) Adjust vectors: Y=aX+Y
 Daily training enabled
› weeks  hours

 Asynchronous
 Faster
 More data
 Larger model
15
Lesson Learned: Approximate Computing  Better Accuracy
…

Summary
16
 Scalable machine learning at Yahoo
› critical business: search, advertisement
› daily model training w/ billions of features
 Hadoop/YARN plays a central role
› approximate computing
› CPU + GPU

17
Thank You!
afeng@apache.org

Empfohlen

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit

Koalas: Making an Easy Transition from Pandas to Apache Spark

Koalas: Making an Easy Transition from Pandas to Apache Spark

Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks

Time-Evolving Graph Processing On Commodity Clusters

Time-Evolving Graph Processing On Commodity Clusters

Time-Evolving Graph Processing On Commodity ClustersJen Aman

Enhancing Spark SQL Optimizer with Reliable Statistics

Enhancing Spark SQL Optimizer with Reliable Statistics

Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman

Koalas: Pandas on Apache Spark

Koalas: Pandas on Apache Spark

Koalas: Pandas on Apache SparkDatabricks

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks

Empfohlen

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit

Koalas: Making an Easy Transition from Pandas to Apache Spark

Koalas: Making an Easy Transition from Pandas to Apache Spark

Koalas: Making an Easy Transition from Pandas to Apache SparkDatabricks

Time-Evolving Graph Processing On Commodity Clusters

Time-Evolving Graph Processing On Commodity Clusters

Time-Evolving Graph Processing On Commodity ClustersJen Aman

Enhancing Spark SQL Optimizer with Reliable Statistics

Enhancing Spark SQL Optimizer with Reliable Statistics

Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman

Koalas: Pandas on Apache Spark

Koalas: Pandas on Apache Spark

Koalas: Pandas on Apache SparkDatabricks

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter

Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...

Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...

Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Databricks

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman

Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache SparkCloudera, Inc.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit

Snorkel: Dark Data and Machine Learning with Christopher Ré

Snorkel: Dark Data and Machine Learning with Christopher Ré

Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks

Modeling with Hadoop kdd2011

Modeling with Hadoop kdd2011

Modeling with Hadoop kdd2011Milind Bhandarkar

How Machine Learning and AI Can Support the Fight Against COVID-19

How Machine Learning and AI Can Support the Fight Against COVID-19

How Machine Learning and AI Can Support the Fight Against COVID-19Databricks

How to use Apache TVM to optimize your ML models

How to use Apache TVM to optimize your ML models

How to use Apache TVM to optimize your ML modelsDatabricks

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit

Inside Apache SystemML by Frederick Reiss

Inside Apache SystemML by Frederick Reiss

Inside Apache SystemML by Frederick ReissSpark Summit

Tokyo Webmining Talk1

Tokyo Webmining Talk1

Tokyo Webmining Talk1Kenta Oono

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache SparkDatabricks

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf

Gradient Boosted Regression Trees in scikit-learn

Gradient Boosted Regression Trees in scikit-learn

Gradient Boosted Regression Trees in scikit-learnDataRobot

Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop ClustersDataWorks Summit/Hadoop Summit

Weitere ähnliche Inhalte

Was ist angesagt?

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovSpark Summit

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman

Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache SparkCloudera, Inc.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark Summit

Snorkel: Dark Data and Machine Learning with Christopher Ré

Snorkel: Dark Data and Machine Learning with Christopher Ré

Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks

Modeling with Hadoop kdd2011

Modeling with Hadoop kdd2011

Modeling with Hadoop kdd2011Milind Bhandarkar

How Machine Learning and AI Can Support the Fight Against COVID-19

How Machine Learning and AI Can Support the Fight Against COVID-19

How Machine Learning and AI Can Support the Fight Against COVID-19Databricks

How to use Apache TVM to optimize your ML models

How to use Apache TVM to optimize your ML models

How to use Apache TVM to optimize your ML modelsDatabricks

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersDataWorks Summit

Inside Apache SystemML by Frederick Reiss

Inside Apache SystemML by Frederick Reiss

Inside Apache SystemML by Frederick ReissSpark Summit

Tokyo Webmining Talk1

Tokyo Webmining Talk1

Tokyo Webmining Talk1Kenta Oono

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache SparkDatabricks

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDatabricks

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016MLconf

Was ist angesagt? (20)

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark

Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Spark and the Future of Advanced Analytics by Thomas Dinsmore

Snorkel: Dark Data and Machine Learning with Christopher Ré

Snorkel: Dark Data and Machine Learning with Christopher Ré

Snorkel: Dark Data and Machine Learning with Christopher Ré

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...

Modeling with Hadoop kdd2011

Modeling with Hadoop kdd2011

Modeling with Hadoop kdd2011

How Machine Learning and AI Can Support the Fight Against COVID-19

How Machine Learning and AI Can Support the Fight Against COVID-19

How Machine Learning and AI Can Support the Fight Against COVID-19

How to use Apache TVM to optimize your ML models

How to use Apache TVM to optimize your ML models

How to use Apache TVM to optimize your ML models

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters

Inside Apache SystemML by Frederick Reiss

Inside Apache SystemML by Frederick Reiss

Inside Apache SystemML by Frederick Reiss

Tokyo Webmining Talk1

Tokyo Webmining Talk1

Tokyo Webmining Talk1

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning Frameworks with Apache Spark

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016

Andere mochten auch

Gradient Boosted Regression Trees in scikit-learn

Gradient Boosted Regression Trees in scikit-learn

Gradient Boosted Regression Trees in scikit-learnDataRobot

Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop ClustersDataWorks Summit/Hadoop Summit

Scaling Machine Learning To Billions Of Parameters

Scaling Machine Learning To Billions Of Parameters

Scaling Machine Learning To Billions Of ParametersJen Aman

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...Jian Xu

Mapreduce in Search

Mapreduce in Search

Mapreduce in SearchAmund Tveit

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Badri Narayan Bhaskar

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit

Square's Machine Learning Infrastructure and Applications - Rong Yan

Square's Machine Learning Infrastructure and Applications - Rong Yan

Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...Universitat Politècnica de Catalunya

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit

Kerry Karl | Debunking Myths: GLUTEN

Kerry Karl | Debunking Myths: GLUTEN

Kerry Karl | Debunking Myths: GLUTENKerry Karl

5 Reasons Why Your Headlines Are On Life Support

5 Reasons Why Your Headlines Are On Life Support

5 Reasons Why Your Headlines Are On Life SupportWishpond

Brexit Webinar Series 3

Brexit Webinar Series 3

Brexit Webinar Series 3U.S. Chamber of Commerce

Ratpack - SpringOne2GX 2015

Ratpack - SpringOne2GX 2015

Ratpack - SpringOne2GX 2015Daniel Woods

Getting Open Data Used

Getting Open Data Used

Getting Open Data UsedAndrew Stott

Impacto de las tics en la educaciòn

Impacto de las tics en la educaciòn

Impacto de las tics en la educaciònDarìo Miranda S.A

Polyglot Gradle with Node.js and Play

Polyglot Gradle with Node.js and Play

Polyglot Gradle with Node.js and PlayEvgeny Goldin

1 4 vamos a jugar

1 4 vamos a jugar

1 4 vamos a jugarAraceli Sanz Muñoz

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva onlineMartin Husovec

Andere mochten auch (20)

Gradient Boosted Regression Trees in scikit-learn

Gradient Boosted Regression Trees in scikit-learn

Gradient Boosted Regression Trees in scikit-learn

Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop Clusters

Scaling Machine Learning To Billions Of Parameters

Scaling Machine Learning To Billions Of Parameters

Scaling Machine Learning To Billions Of Parameters

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...

Mapreduce in Search

Mapreduce in Search

Mapreduce in Search

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Spark-on-YARN: Empower Spark Applications on Hadoop Cluster

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Scaling Machine Learning to Billions of Parameters - Spark Summit 2016

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Square's Machine Learning Infrastructure and Applications - Rong Yan

Square's Machine Learning Infrastructure and Applications - Rong Yan

Square's Machine Learning Infrastructure and Applications - Rong Yan

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Kerry Karl | Debunking Myths: GLUTEN

Kerry Karl | Debunking Myths: GLUTEN

Kerry Karl | Debunking Myths: GLUTEN

5 Reasons Why Your Headlines Are On Life Support

5 Reasons Why Your Headlines Are On Life Support

5 Reasons Why Your Headlines Are On Life Support

Brexit Webinar Series 3

Brexit Webinar Series 3

Brexit Webinar Series 3

Ratpack - SpringOne2GX 2015

Ratpack - SpringOne2GX 2015

Ratpack - SpringOne2GX 2015

Getting Open Data Used

Getting Open Data Used

Getting Open Data Used

Impacto de las tics en la educaciòn

Impacto de las tics en la educaciòn

Impacto de las tics en la educaciòn

Polyglot Gradle with Node.js and Play

Polyglot Gradle with Node.js and Play

Polyglot Gradle with Node.js and Play

1 4 vamos a jugar

1 4 vamos a jugar

1 4 vamos a jugar

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online

Ähnlich wie Surge: Rise of Scalable Machine Learning at Yahoo!

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit

Big Data Lessons from the Cloud

Big Data Lessons from the Cloud

Big Data Lessons from the CloudMapR Technologies

SparkML: Easy ML Productization for Real-Time Bidding

SparkML: Easy ML Productization for Real-Time Bidding

SparkML: Easy ML Productization for Real-Time BiddingDatabricks

Gluent Extending Enterprise Applications with Hadoop

Gluent Extending Enterprise Applications with Hadoop

Gluent Extending Enterprise Applications with Hadoopgluent.

(CMP305) Deep Learning on AWS Made EasyCmp305

(CMP305) Deep Learning on AWS Made EasyCmp305

(CMP305) Deep Learning on AWS Made EasyCmp305Amazon Web Services

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Matej Misik

Things you can find in the plan cache

Things you can find in the plan cache

Things you can find in the plan cachesqlserver.co.il

AutoML for user segmentation: how to match millions of users with hundreds of...

AutoML for user segmentation: how to match millions of users with hundreds of...

AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences

Danny Bickson - Python based predictive analytics with GraphLab Create

Danny Bickson - Python based predictive analytics with GraphLab Create

Danny Bickson - Python based predictive analytics with GraphLab Create PyData

Toronto meetup 20190917

Toronto meetup 20190917

Toronto meetup 20190917Bill Liu

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

A Database-Hadoop Hybrid Approach to Scalable Machine LearningMakoto Yui

Machine Learning: How small businesses can enter the race

Machine Learning: How small businesses can enter the race

Machine Learning: How small businesses can enter the raceScaleway

Big Data Testing

Big Data Testing

Big Data TestingQA InfoTech

Relevance of time series databases & druid.io

Relevance of time series databases & druid.io

Relevance of time series databases & druid.ioMuniraju V

Big Data at DYNO

Big Data at DYNO

Big Data at DYNOTu Pham

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreAmazon Web Services

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...Databricks

Modern recommender system in large content website

Modern recommender system in large content website

Modern recommender system in large content websiteCyrus Chien-Ching Chiu

Ähnlich wie Surge: Rise of Scalable Machine Learning at Yahoo! (20)

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Multi Model Machine Learning by Maximo Gurmendez and Beth Logan

Big Data Lessons from the Cloud

Big Data Lessons from the Cloud

Big Data Lessons from the Cloud

SparkML: Easy ML Productization for Real-Time Bidding

SparkML: Easy ML Productization for Real-Time Bidding

SparkML: Easy ML Productization for Real-Time Bidding

Gluent Extending Enterprise Applications with Hadoop

Gluent Extending Enterprise Applications with Hadoop

Gluent Extending Enterprise Applications with Hadoop

(CMP305) Deep Learning on AWS Made EasyCmp305

(CMP305) Deep Learning on AWS Made EasyCmp305

(CMP305) Deep Learning on AWS Made EasyCmp305

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Things you can find in the plan cache

Things you can find in the plan cache

Things you can find in the plan cache

AutoML for user segmentation: how to match millions of users with hundreds of...

AutoML for user segmentation: how to match millions of users with hundreds of...

AutoML for user segmentation: how to match millions of users with hundreds of...

Danny Bickson - Python based predictive analytics with GraphLab Create

Danny Bickson - Python based predictive analytics with GraphLab Create

Danny Bickson - Python based predictive analytics with GraphLab Create

Toronto meetup 20190917

Toronto meetup 20190917

Toronto meetup 20190917

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Machine Learning: How small businesses can enter the race

Machine Learning: How small businesses can enter the race

Machine Learning: How small businesses can enter the race

Big Data Testing

Big Data Testing

Big Data Testing

Relevance of time series databases & druid.io

Relevance of time series databases & druid.io

Relevance of time series databases & druid.io

Big Data at DYNO

Big Data at DYNO

Big Data at DYNO

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...

Modern recommender system in large content website

Modern recommender system in large content website

Modern recommender system in large content website

Mehr von DataWorks Summit

Data Science Crash Course

Data Science Crash Course

Data Science Crash CourseDataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Managing the Dewey Decimal SystemDataWorks Summit

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Mehr von DataWorks Summit (20)

Data Science Crash Course

Data Science Crash Course

Data Science Crash Course

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBase Tales From the Trenches - Short stories about most common HBase operati...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

Practical NoSQL: Accumulo's dirlist Example

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Presentation on how to chat with PDF using ChatGPT code interpreter

Presentation on how to chat with PDF using ChatGPT code interpreter

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CVKhem

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024Results

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and MythsJoaquim Jorge

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

Histor y of HAM Radio presentation slide

Histor y of HAM Radio presentation slide

Histor y of HAM Radio presentation slidevu2urc

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Presentation on how to chat with PDF using ChatGPT code interpreter

Presentation on how to chat with PDF using ChatGPT code interpreter

Presentation on how to chat with PDF using ChatGPT code interpreter

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking Men

08448380779 Call Girls In Friends Colony Women Seeking Men

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

Real Time Object Detection Using Open CV

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

🐬 The future of MySQL is Postgres 🐘

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

A Call to Action for Generative AI in 2024

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

Artificial Intelligence: Facts and Myths

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Powerful Google developer tools for immediate impact! (2023-24 C)

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Finology Group – Insurtech Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

Automating Google Workspace (GWS) & more with Apps Script

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

08448380779 Call Girls In Civil Lines Women Seeking Men

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

Histor y of HAM Radio presentation slide

Histor y of HAM Radio presentation slide

Histor y of HAM Radio presentation slide

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

Data Cloud, More than a CDP by Matt Robison

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

How to Troubleshoot Apps for the Modern Connected Worker

Surge: Rise of Scalable Machine Learning at Yahoo!

1. Rise of Scalable Machine Learning at Yahoo A n d y F e n g V P A r c h i t ect ur e , Ya h o o

2. My Talks @ Hadoop Summit 2  Storm (2013)  Spark (2014)  Machine Learning (2015)

3. 3 Use Case: Search & Advertisement  Application needs › Content ranking › Ad click prediction › Query-Ads matching  Machine learning algorithm › Gradient boosted decision tree › Logistic regression › Neural network

4. Challenge: Scale 4 1. Massive amount of examples › Naïve solutions take days/weeks 2. Billions of features › Model exceeds memory limits of 1 computer 3. Variety of algorithms › Different solutions required for scale-up

5. Massive Hadoop at Yahoo 5 600 PB HDFS 43K Computers MACHINE LEARNING

6. Big-Data ML in Action 6 ML learner ML server Map Reduce

7. 7 Architecture for Scalable ML  ML Server › Customized in-memory stores (Hashmap, Matrix) • Lockless concurrency • Zero garbage created › Map/Reduce API to move computing to servers

8. 3 Examples of ML Algorithms 8 Yahoo Confidential & Proprietary 1. Gradient Boosted Decision Tree › Problem: Training latency › Solution: Hadoop streaming + MPI 2. Logistic Regression › Problem: Model size › Solution: Spark + ML Server 3. Ad-Query Vectors › Problem: Model size + Training latency › Solution: Spark + ML Server

9. Algorithm 1: Gradient Boosted Decision Tree  Boosting is sequential  Training takes days for 1000s of features

10. Gradient Boosted Decision Tree: 30x Speed-up

11. Algorithm 2: Logistic Regression 11 When |β| > 100B, › 100 Billion * 16 Bytes = 1.6 TB › β exceeds memory limit of 1 computer

12. Logistic Regression: 1000x Scale-up 12

13.  Vector: numeric representation of queries/ads › Vector(“san jose weather”) ≈ Vector(“weather 95113”) ≈ Vector(ad123)  Model size › 1 Billion* 300 dimensions = 2.4TB  Vector computation (X*Y, aX+Y) › Took weeks for small datasets 13 * Yahoo Labs: http://bit.ly/1G3f6L2 Algorithm 3: Ad-Query Vectors

14. Ad-Query2Vec: 100x Speed/Scale-up 14  Computation on servers › (1) Negative sampling › (1) Compute gradient: X*Y › (3) Adjust vectors: Y=aX+Y  Daily training enabled › weeks  hours

15.  Asynchronous  Faster  More data  Larger model 15 Lesson Learned: Approximate Computing  Better Accuracy …

16. Summary 16  Scalable machine learning at Yahoo › critical business: search, advertisement › daily model training w/ billions of features  Hadoop/YARN plays a central role › approximate computing › CPU + GPU

17. 17 Thank You! afeng@apache.org

Hinweis der Redaktion

Good afternoon. I am Andy Feng from Yahoo. In this talk, I will share our recent effort to enable large scale machine learning on hadoop clusters.
In 2013, I talked about Yahoo’s adoption of Storm for low-latency processing. Last year, I described Yahoo’s effort to bring Spark onto YARN cluster. Today, we should cover our progress on machine learning using YARN clusters. I will cover 3 areas: WHY does Yahoo apply machine learning WHAT challenges we try to address HOW we address them I will wrap the talk with key lessons learned from our experience.
Let’s start with WHY machine learning. Search is one of the key applications for Yahoo. For a user’s search phrase, we construct a result page with organic contents together with ads. To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click. Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
Machine learning at Yahoo has an scalability challenge. 1st, the # of training examples. In order to produce an accurate machine learning model, Yahoo examines massive amount of training examples. For example, we examine several months of user search activity logs. Typically, we are looking at hundreds billions of training examples. When naïve solutions are applied, the training process could take several weeks. Models that represents what happened weeks ago have limited business value, since they don’t represent the current state of our users and contents. 2nd the # of features. We need to pick up signals from all possible signals. It’s usual for Yahoo to consider billions of features in our model. 3rd, we use variety of algorithms, and different solutions are required for scaling up these algorithms. We want our machine learning algorithms massive scalable.
We believe that Hadoop is an ideal platform for scalable machine learning. Yahoo has one of the largest Hadoop deployments in the world. At the moment we store 600 PB on 43 thousands nodes. In last year, we decide to make hadoop the single best system for running large scale machine learning applications. So lets look into a bit under the hood
At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily. Here is a screenshot from one Hadoop cluster. In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
ML Server Customized in-memory stores (Hashmap, Matrix) Lockless concurrency Zero garbage created Map/Reduce API to move computing to servers To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers. These servers are a YARN application, specfically design for machine learning. All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second. Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection. Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners. To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers. As an example, you may want to perform statistic analysis of large models using MapReduce operations. Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
Let me share 3 success stories about machine learning algorithms. Our 1st story illustrates how Hadoop and MPI could dramatically reduce training latency. Our 2nd story shows Spark and YARN to enable training of very large machine learning models. Our 3rd story will attack both model size and training latency.
Let’s start with our 1st story. In gradient boosted decision trees, we represent model as a collection of decision trees. Each tree node represents a decision point using one feature. By top-down walk through of these trees, you will reach leaf nodes with numerical values. Adding those numerical values together will be our prediction for a given example. To achieve high accuracy, we tend to have many trees. In Yahoo use cases, we use thousands of trees. To construct such trees, we have to construct trees one-by-one. Within each trees, we have to tree layer-by-layer. For each node, we need to select a best feature and a best value for the split. If you use a single machine, the training process could take several days.
At Yahoo, we developed a GBDT algorithm on top of Hadoop and MPI, and achieve 30X speed-up. More specifically, we use Hadoop Streaming to launch multiple GBDT workers. We partition training examples by columns, instead of rows. Each worker has subset of features for all training examples. During the training, each worker perform local computation to identical best splits for its feature set. We then apply MPI allreduce operation to decide the global best split among all features, and broadcast the best split to all workers. We repeat this process for all tree nodes. At the end, we have a collection of trees as our model. In this approach, we could tens of Hadoop mappers, and fully utilize their computation power. We achieved 30 times speedup for Yahoo use cases. For a training job previously took days, we could now produce decision trees in about 1 hour.
Our 2nd story is about logistic regression. For a given vector X, logistic regression predicts the outcome via a logistic function to the dot product of parameters beta and feature vector. During training phase, we try to find the best parameter beta from training examples. If we have 2 parameters, logistic regression is find a line in 2 dimensional space to best fit our examples. The scalability challenge is around the # of parameters. We want to have produce model with over 100B parameters. Assume that each parameter uses 16 bytes, the storage of our parameters require 1.6 TB. We could not store the model in memory of a single computer.
To enable large models, we decide to use multiple servers in a YARN cluster. Each server keeps a subset of parameters in memory. We launch logistic regression learners as a Spark job on YARN. Each learner will cover a subset of training examples from HDFS. For each example, we will fetch current parameter values from servers, compute gradient, and update servers with latest value. This new architecture enables us to scale up learning 1000 times. Our previous models had at thousands or millions parameters, and our new model now has billions of parameters. All learners are perform learning independently. There is no synchronous across learners at all. Therefore, we could learn from massive amount of training data very quickly. As a result, our model w/ billions of parameters is significantly more precise than our previous models. That has brought us meaningful business impact.
Our 3rd story is related to search query and ads. In this case, we are learning numerical vectors of search queries and ads from user session logs. From these vectors, we will be able to know that query terms “san jose weather” and “weather 95113” are essentially identical. We learn vectors from user’s search sessions. Each search session will have a collection of query terms and ads. Ad/query vectors are learned by applying n-gram techniques. Details of our algorithm is explained in a recent conference paper from yahoo labs. In this use case, we have 2 problems. First, we have billions of query terms. If each vector has 300 dimensions, we will need 2.4 TB memory space for vector storage. That’s way beyond our typical computer today. Vector calculation is very expensive. For each search sessions, we need to perform hundreds vector operations such as multiplication and addition. For a relative small datasets, training could take weeks.
For computing vectors of queries and ads, we use a set of matrix servers on YARN cluster. Each server has a subset of columns of our matrix. These servers has built-in matrix operations such as vector multiplication and addition. We use Spark job to launch multiple learners on a YARN cluster. Each learner will examine a subset of training dataset. To reduce data movement, we conduct majority of computation on servers. For each training example, we let each server to produce negative examples, and calculate gradients locally. Then, our learner calculate a global coefficence based on each server’s partial gradients. Finally, we let each server adjust vectors. This distributed solution enables us to train vectors within a few hours. Remember it took several weeks for such a task previously. 100X speedup using YARN.
From these use cases, we learned one important lesson. That is, big-data approximate computing could produce more accurate models. In all use cases, we use a set of computers to learning from dataset, and produce a mathematical model. We want each learners to conduct their learning as fast as they can. We don’t want any synchronization across learners. We even let learners to overwrite each other in the shared data model. Each execution may produce slightly different result. We are performing approximate computing on YARN. At the end, we produce a mathematical model with large # of parameters. Since this model represent the signals from massive amount of data, our model is more accurate than previous model built from precise computing.
In summary, Yahoo has made significant progress on scalable machine learning. We conduct daily training w/ billions of signals for our critical business such as search and advertisement. Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing. We are currently exploring both GPU and CPU in a single cluster.