SlideShare ist ein Scribd-Unternehmen logo
1 von 25
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 1
Scalable Machine Learning
- or -
what to do with all that Big Data infrastructure
Mikio L. Braun
TU Berlin
blog.mikiobraun.de
@mikiobraun
Strata+Hadoop World
London, 2015
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 2
Complex Data Analysis at Scale
● Click-through prediction
● Personalized Spam Detection
● Image Classification
● Recommendation
● ...
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 3
Click Through Prediction
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 4
Personalized Spam Detection
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 5
Supervised Learning in a Nutshell
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 6
Large Scale Learning
● 100 millions of examples
● Millions of (potential) features
● Linear models: Learn one weight per feature
● Deep learning: Complex models with several
stages of processing
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 7
ML Pipeline
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 8
Easily parallelized
● Data preparation, exactraction, normalization
● Parallel runs of ML pipelines for evaluation
● Mass application of data sets
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 9
What about the training step?
● Optimization problems
● Sometimes closed-form solutions (→ matrix
computations)
● Most often not, or it is slow
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 10
Gradient Descent
● Very generic way to optimize functions
● Update parameters along line of steepest
descent of cost function
e.g. logistic regression
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 11
Gradient Descent in Parallel
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 12
Stochastic Gradient Descent
● #1 workhorse for large scale learning
● Gradient Descent = Error on whole data set
● Stochastic Gradient Descent =
consider one point at a time
1. predict on one point
2. compute gradient for that one point
3. update parameters
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 13
Why can we do this?
● Learning is
about prediction
performance on
future data
● Lots of freedom
to find
alternative
algorithms
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 14
Ad Click Prediction at Google
● Variant of SGD
● With tons of hacks: 16-bit fixed point, sharing memory, etc.
● No parallelization!
SIGKDD '13
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 15
SGD is hard to parallelize
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 16
Parameter Servers
Dean et al. “Large Scale Distributed Deep Networks”, NIPS 2012
Li et al. “Communication Efficient Distributed Machine Learning with the Parameter
Server”, NIPS 2014
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 17
Parallelized vs. Distributed Search
● Faster
optimization by
parallel
searches
● Randomized
updates lead
to overall
better
performance
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 18
Approximation is key
● Goal is good predictions on future data
● You can solve an easier problem instead
→ not just for the learning algorithms!
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 19
Approximating Features with
Hashing
● Millions of features, but often very sparse
● Use random projections to “compress” feature
space
Attenberg et al. “Collaborative Email-Spam Filtering with the Hashing-Trick” CEAS 2009
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 21
Counting: Count Min-Sketch
● update: hash value, increase counters
● query: hash value, take minimum
Cormode, Muthukrishnan, "An Improved Data Stream Summary: The Count-Min Sketch and
its Applications". J. Algorithms 2004
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 22
Clustering with Count Min-Sketches
Aggarwal et al. “A Framework for Clustering Massive-Domain Data Streams”, ICDE 2009
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 24
Scalable ML is...
● NOT: learning algorithms + parallelization
● INSTEAD:
– faster learning algorithms
– approximative optimization
– feature hashing
– approximative counting
– asynchronous distributed search
– AND parallelization
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 25
So where are we with Spark and
friends
● Original MR paper: Focus on simple stuff
● Distributed Collections & Dataflow the right
abstraction?
● Distributed Matrices?
● Who tells us what the right approximation is?
● New mode of asynchronous, distributed,
approximative computation?
● OR go with something which is slower but easier
to parallelize?
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 26
Summary
● Large scale data sets: billions of examples,
millions of features
● A lot of the ML pipeline (feature extraction, data
preparation, evaluation) is parallelizable and
should be!
● Large scale learning algorithms are harder to
parallelize
● Approximation, asynchronous distributed
computing, etc.
May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 27
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
maria.grineva
 
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at ZendeskThe More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
Databricks
 

Was ist angesagt? (19)

Magellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram SriharshaMagellen: Geospatial Analytics on Spark by Ram Sriharsha
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Data Analytics - Real Time Trending
Data Analytics -  Real Time TrendingData Analytics -  Real Time Trending
Data Analytics - Real Time Trending
 
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at ZendeskThe More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Graphalytics: A big data benchmark for graph processing platforms
Graphalytics: A big data benchmark for graph processing platformsGraphalytics: A big data benchmark for graph processing platforms
Graphalytics: A big data benchmark for graph processing platforms
 
Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018Graph Databases and Machine Learning | November 2018
Graph Databases and Machine Learning | November 2018
 
MLSD18. Summary of Morning Sessions
MLSD18. Summary of Morning SessionsMLSD18. Summary of Morning Sessions
MLSD18. Summary of Morning Sessions
 
Introduction to the IBM Watson Data Platform
Introduction to the IBM Watson Data PlatformIntroduction to the IBM Watson Data Platform
Introduction to the IBM Watson Data Platform
 
PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics PowerStream: Propelling Energy Innovation with Predictive Analytics
PowerStream: Propelling Energy Innovation with Predictive Analytics
 
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real WorldWSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
WSO2Con USA 2015: Patterns for Deploying Analytics in the Real World
 
Building a Graph-based Analytics Platform
Building a Graph-based Analytics PlatformBuilding a Graph-based Analytics Platform
Building a Graph-based Analytics Platform
 
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
 
MLSD18. Basic Transformations - QCRI
MLSD18. Basic Transformations - QCRIMLSD18. Basic Transformations - QCRI
MLSD18. Basic Transformations - QCRI
 
IP EXPO Europe: Data Science in the Cloud
IP EXPO Europe: Data Science in the CloudIP EXPO Europe: Data Science in the Cloud
IP EXPO Europe: Data Science in the Cloud
 
Saving Human Lives with the IoT
Saving Human Lives with the IoTSaving Human Lives with the IoT
Saving Human Lives with the IoT
 
MLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigMLMLSD18. Basic Transformations - BigML
MLSD18. Basic Transformations - BigML
 
WeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew ReyesWeCloudData Toronto Open311 Workshop - Matthew Reyes
WeCloudData Toronto Open311 Workshop - Matthew Reyes
 

Andere mochten auch

Fault Detection and Failure Prediction Using Vibration Analysis
Fault Detection and Failure Prediction Using Vibration AnalysisFault Detection and Failure Prediction Using Vibration Analysis
Fault Detection and Failure Prediction Using Vibration Analysis
Tristan Plante
 

Andere mochten auch (20)

How Machine Learning Works for Business
How Machine Learning Works for BusinessHow Machine Learning Works for Business
How Machine Learning Works for Business
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Performance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and CassandraPerformance is not an Option - gRPC and Cassandra
Performance is not an Option - gRPC and Cassandra
 
The adoption of machine learning techniques for software defect prediction: A...
The adoption of machine learning techniques for software defect prediction: A...The adoption of machine learning techniques for software defect prediction: A...
The adoption of machine learning techniques for software defect prediction: A...
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
 
Machine Duping 101: Pwning Deep Learning Systems
Machine Duping 101: Pwning Deep Learning SystemsMachine Duping 101: Pwning Deep Learning Systems
Machine Duping 101: Pwning Deep Learning Systems
 
Survey on Software Defect Prediction
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
 
Enabling Googley microservices with HTTP/2 and gRPC.
Enabling Googley microservices with HTTP/2 and gRPC.Enabling Googley microservices with HTTP/2 and gRPC.
Enabling Googley microservices with HTTP/2 and gRPC.
 
Aeroprobing A.I. Drone with TX1
Aeroprobing A.I. Drone with TX1Aeroprobing A.I. Drone with TX1
Aeroprobing A.I. Drone with TX1
 
A survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithmsA survey of fault prediction using machine learning algorithms
A survey of fault prediction using machine learning algorithms
 
Just in-time inventory system
Just in-time inventory systemJust in-time inventory system
Just in-time inventory system
 
Fault Detection and Failure Prediction Using Vibration Analysis
Fault Detection and Failure Prediction Using Vibration AnalysisFault Detection and Failure Prediction Using Vibration Analysis
Fault Detection and Failure Prediction Using Vibration Analysis
 
Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
 Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t... Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
Robert Kubis - gRPC - boilerplate to high-performance scalable APIs - code.t...
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Big-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunitiesBig-data analytics: challenges and opportunities
Big-data analytics: challenges and opportunities
 
Just In Time (JIT) Systems
Just In Time (JIT) SystemsJust In Time (JIT) Systems
Just In Time (JIT) Systems
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
 

Ähnlich wie Scalable Machine Learning

Ähnlich wie Scalable Machine Learning (20)

Realtime Data Analysis Patterns
Realtime Data Analysis PatternsRealtime Data Analysis Patterns
Realtime Data Analysis Patterns
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
 
Gospel - High-performance heterogeneous architectures for graph analytics
 Gospel - High-performance heterogeneous architectures for graph analytics Gospel - High-performance heterogeneous architectures for graph analytics
Gospel - High-performance heterogeneous architectures for graph analytics
 
A few Challenges to Make Machine Learning Easy
A few Challenges to Make Machine Learning EasyA few Challenges to Make Machine Learning Easy
A few Challenges to Make Machine Learning Easy
 
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
 
A few questions about large scale machine learning
A few questions about large scale machine learningA few questions about large scale machine learning
A few questions about large scale machine learning
 
Google Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended HanoiGoogle Cloud: Next'19 Extended Hanoi
Google Cloud: Next'19 Extended Hanoi
 
Crowd Surfing Tweets by Kivanc Yazan
Crowd Surfing Tweets by Kivanc YazanCrowd Surfing Tweets by Kivanc Yazan
Crowd Surfing Tweets by Kivanc Yazan
 
BigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQLBigQuery ML - Machine learning at scale using SQL
BigQuery ML - Machine learning at scale using SQL
 
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)Scaling Recommendations at Quora (RecSys talk 9/16/2016)
Scaling Recommendations at Quora (RecSys talk 9/16/2016)
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
From DevOps to MLOps: practical steps for a smooth transition
From DevOps to MLOps: practical steps for a smooth transitionFrom DevOps to MLOps: practical steps for a smooth transition
From DevOps to MLOps: practical steps for a smooth transition
 
Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
Machine Learning in Big Data
Machine Learning in Big DataMachine Learning in Big Data
Machine Learning in Big Data
 
Machine Learning for Non-Technical People - Turing Fest 2019
Machine Learning for Non-Technical People - Turing Fest 2019Machine Learning for Non-Technical People - Turing Fest 2019
Machine Learning for Non-Technical People - Turing Fest 2019
 
Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!Big Data Analytics is not something which was just invented yesterday!
Big Data Analytics is not something which was just invented yesterday!
 
Life of a data engineer
Life of a data engineerLife of a data engineer
Life of a data engineer
 
Sakshi sharma resume
Sakshi sharma resumeSakshi sharma resume
Sakshi sharma resume
 
Machine Learning with Data Science Online Course | Learn and Build
 Machine Learning with Data Science Online Course | Learn and Build  Machine Learning with Data Science Online Course | Learn and Build
Machine Learning with Data Science Online Course | Learn and Build
 

Mehr von Mikio L. Braun

Mehr von Mikio L. Braun (7)

Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020Bringing ML To Production, What Is Missing? AMLD 2020
Bringing ML To Production, What Is Missing? AMLD 2020
 
Academia to industry looking back on a decade of ml
Academia to industry looking back on a decade of mlAcademia to industry looking back on a decade of ml
Academia to industry looking back on a decade of ml
 
Architecting AI Applications
Architecting AI ApplicationsArchitecting AI Applications
Architecting AI Applications
 
Hardcore Data Science - in Practice
Hardcore Data Science - in PracticeHardcore Data Science - in Practice
Hardcore Data Science - in Practice
 
Data flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into FlinkData flow vs. procedural programming: How to put your algorithms into Flink
Data flow vs. procedural programming: How to put your algorithms into Flink
 
Cassandra - An Introduction
Cassandra - An IntroductionCassandra - An Introduction
Cassandra - An Introduction
 
Cassandra - Eine Einführung
Cassandra - Eine EinführungCassandra - Eine Einführung
Cassandra - Eine Einführung
 

Kürzlich hochgeladen

%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Kürzlich hochgeladen (20)

Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 

Scalable Machine Learning

  • 1. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 1 Scalable Machine Learning - or - what to do with all that Big Data infrastructure Mikio L. Braun TU Berlin blog.mikiobraun.de @mikiobraun Strata+Hadoop World London, 2015
  • 2. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 2 Complex Data Analysis at Scale ● Click-through prediction ● Personalized Spam Detection ● Image Classification ● Recommendation ● ...
  • 3. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 3 Click Through Prediction
  • 4. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 4 Personalized Spam Detection
  • 5. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 5 Supervised Learning in a Nutshell
  • 6. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 6 Large Scale Learning ● 100 millions of examples ● Millions of (potential) features ● Linear models: Learn one weight per feature ● Deep learning: Complex models with several stages of processing
  • 7. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 7 ML Pipeline
  • 8. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 8 Easily parallelized ● Data preparation, exactraction, normalization ● Parallel runs of ML pipelines for evaluation ● Mass application of data sets
  • 9. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 9 What about the training step? ● Optimization problems ● Sometimes closed-form solutions (→ matrix computations) ● Most often not, or it is slow
  • 10. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 10 Gradient Descent ● Very generic way to optimize functions ● Update parameters along line of steepest descent of cost function e.g. logistic regression
  • 11. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 11 Gradient Descent in Parallel
  • 12. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 12 Stochastic Gradient Descent ● #1 workhorse for large scale learning ● Gradient Descent = Error on whole data set ● Stochastic Gradient Descent = consider one point at a time 1. predict on one point 2. compute gradient for that one point 3. update parameters
  • 13. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 13 Why can we do this? ● Learning is about prediction performance on future data ● Lots of freedom to find alternative algorithms
  • 14. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 14 Ad Click Prediction at Google ● Variant of SGD ● With tons of hacks: 16-bit fixed point, sharing memory, etc. ● No parallelization! SIGKDD '13
  • 15. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 15 SGD is hard to parallelize
  • 16. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 16 Parameter Servers Dean et al. “Large Scale Distributed Deep Networks”, NIPS 2012 Li et al. “Communication Efficient Distributed Machine Learning with the Parameter Server”, NIPS 2014
  • 17. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 17 Parallelized vs. Distributed Search ● Faster optimization by parallel searches ● Randomized updates lead to overall better performance
  • 18. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 18 Approximation is key ● Goal is good predictions on future data ● You can solve an easier problem instead → not just for the learning algorithms!
  • 19. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 19 Approximating Features with Hashing ● Millions of features, but often very sparse ● Use random projections to “compress” feature space Attenberg et al. “Collaborative Email-Spam Filtering with the Hashing-Trick” CEAS 2009
  • 20. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 21 Counting: Count Min-Sketch ● update: hash value, increase counters ● query: hash value, take minimum Cormode, Muthukrishnan, "An Improved Data Stream Summary: The Count-Min Sketch and its Applications". J. Algorithms 2004
  • 21. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 22 Clustering with Count Min-Sketches Aggarwal et al. “A Framework for Clustering Massive-Domain Data Streams”, ICDE 2009
  • 22. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 24 Scalable ML is... ● NOT: learning algorithms + parallelization ● INSTEAD: – faster learning algorithms – approximative optimization – feature hashing – approximative counting – asynchronous distributed search – AND parallelization
  • 23. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 25 So where are we with Spark and friends ● Original MR paper: Focus on simple stuff ● Distributed Collections & Dataflow the right abstraction? ● Distributed Matrices? ● Who tells us what the right approximation is? ● New mode of asynchronous, distributed, approximative computation? ● OR go with something which is slower but easier to parallelize?
  • 24. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 26 Summary ● Large scale data sets: billions of examples, millions of features ● A lot of the ML pipeline (feature extraction, data preparation, evaluation) is parallelizable and should be! ● Large scale learning algorithms are harder to parallelize ● Approximation, asynchronous distributed computing, etc.
  • 25. May 07, 2015Scalable Machine Learning Mikio L. Braun @mikiobraun 27 Thank you!