SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Netflix’s Recommendation
ML Pipeline using
Apache Spark
DB Tsai
Spark Summit East - Feb 8, 2017
At Netflix, we use ML everywhere
Everything is
a Recommendation
Over 80% of what members watch comes
from our recommendations
Recommendations are driven by Machine
Learning Algorithms
Jan 6th, 2016
#NetflixEverywhere
● 93+ Million Members
● 190+ Countries
● 125+ Million streaming hours / day
● 1000 hours of Original content in 2017
● ⅓ of US internet traffic during evenings
Constantly Innovating
through A/B tests
Try an idea offline using historical data to see
if they would have made better
recommendations
If it does, deploy a live A/B test to see if it
performs well in Production
Data Driven
Running an Experiment
Design Experiment
Collect Label Dataset
DeLorean: Offline
Feature Generation
Distributed
Model Training
Parallel training
of individual
models using
different
executors
Compute
Validation Metrics
Model Testing
Choose
best model
Design a New Experiment to Test Out Different Ideas
Good
Metrics
Offline
Experiment
Online
A/B Test
Online
AB Testing
Bad Metrics
Selected
Contexts
We use a standardized data format across
multiple ranking pipelines
This standardized data format is used by
common tooling, libraries, and algorithms
Contexts: The setting for evaluating a set of items (e.g.
tuples of member profiles, country, time, device, etc.)
Items: The elements to be trained on, scored, and/or
ranked (e.g. videos, rows, search entities)
Labels: For supervised learning, this will be the label
(target) for each item
Ranking problems
root
|-- profile_id: long (nullable = false)
|-- country_iso_code: string (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- show_title_id: long (nullable = false)
| | |-- label: double (nullable = false)
| | |-- weight: double (nullable = false)
| | |-- features: struct (nullable = false)
| | | |-- feature1: double (nullable = false)
| | | |-- feature2: double (nullable = false)
| | | |-- feature3: double (nullable = false)
DeLorean Data Format a.k.a DMC-12
The nested data structure avoids an expensive
shuffle when ranking
The features are derived from Netflix data or the
output of other trained models
The features are persisted in HIVE using Parquet
Ensemble methods are used to build rankers
Transformer
https://en.wikipedia.org/wiki/Spark_(Transformers)
Transformer takes an input DataFrame and “lazily”
returns an output DataFrame
Item Transformer
● Extends Spark ML’s Transformer
● Accepts DMC-12 DataFrame with contextual
information
● Transforms DataFrame at the item level
Why DataFrame?
Catalyst Optimizations
Up-front Schema Verification
We found a 4x speedup during feature generation
by migrating from RDD-based implementation to
DataFrame implementation
Negative Generator
Facts
root
|-- profile_id: long (nullable = false)
|-- country_iso_code: string (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- show_title_id: long (nullable = false)
| | |-- label: double (nullable = false)
| | |-- weight: double (nullable = false)
Facts with synthetic negatives
root
|-- profile_id: long (nullable = false)
|-- country_iso_code: string (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- show_title_id: long (nullable = false)
| | |-- label: double (nullable = false)
| | |-- weight: double (nullable = false)
Creating negatives from what
member plays for supervised learning
DeLorean Feature Generator
root
|-- profile_id: long (nullable = false)
|-- country_iso_code: string (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- show_title_id: long (nullable = false)
| | |-- label: double (nullable = false)
| | |-- weight: double (nullable = false)
root
|-- profile_id: long (nullable = false)
|-- country_iso_code: string (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- show_title_id: long (nullable = false)
| | |-- label: double (nullable = false)
| | |-- weight: double (nullable = false)
| | |-- features: struct (nullable = false)
| | | |-- feature1: double (nullable = false)
| | | |-- feature2: double (nullable = false)
| | | |-- feature3: double (nullable = false)
Creating features based on common code base
in offline and online system
http://techblog.netflix.com/2016/02/distributed-time-travel-for-feature.html
Creating the Dataset for Algorithms
Multithreading Model Training
For single machine multi-threading algorithms, we
allocate one task to one machine. Multiple tasks
are running in Spark for different parameters
Broadcast in Spark has datasize limitation, we
write data into HDFS, and stream the data into the
trainers in executors which run single-machine
multi-threading algorithms
Distributed Model Training
We use both Spark ML’s algorithms and in-house
ML implementations
We keep the interface similar for both
multi-threading and distributed algorithms, so
experimenters can try different ideas easily
Scoring and Ranking
Scorer is also a Transformer
returned from the Trainer
Multiple models can be scored at
the same time in parallel
The ranks are derived from sorted
scores
Together with labels, we can
compute metrics, NMRR, NDCG,
and Recall, etc
root
|-- profile_id: long (nullable = false)
|-- country_iso_code: string (nullable = false)
|-- items: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- show_title_id: long (nullable = false)
| | |-- label: double (nullable = false)
| | |-- weight: double (nullable = false)
| | |-- features: struct (nullable = false)
| | | |-- feature1: double (nullable = false)
| | | |-- feature2: double (nullable = false)
| | | |-- feature3: double (nullable = false)
| | |-- scores: struct (nullable = false)
| | | |-- model1: double false
| | | |-- model2: double false
Lessons Learned - Pipeline Abstraction
Pros
● Modularity + Tests
● Plug-N-Play
● Notebook Prototyping
● Customizability
● Schema Verification
● Serializability (AC)
Cons
● Dependent on Spark platform
● Not easy to bring to production
● Default Metric Evaluator doesn’t
support ranking multiple models
and type of metrics in one pass
2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai

Weitere Àhnliche Inhalte

Was ist angesagt?

Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissSpark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce conceptsSubhas Kumar Ghosh
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in ScalaPatrick Nicolas
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Spark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlibXiangrui Meng
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 

Was ist angesagt? (20)

Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Dplyr packages
Dplyr packagesDplyr packages
Dplyr packages
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalOverview of Apache SystemML by Berthold Reinwald and Nakul Jindal
Overview of Apache SystemML by Berthold Reinwald and Nakul Jindal
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Advanced Functional Programming in Scala
Advanced Functional Programming in ScalaAdvanced Functional Programming in Scala
Advanced Functional Programming in Scala
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Sparse Data Support in MLlib
Sparse Data Support in MLlibSparse Data Support in MLlib
Sparse Data Support in MLlib
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 

Ähnlich wie 2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai

Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to productionHerman Wu
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
HLC301-Simplifying Healthcare Data Management on AWS.pdf
HLC301-Simplifying Healthcare Data Management on AWS.pdfHLC301-Simplifying Healthcare Data Management on AWS.pdf
HLC301-Simplifying Healthcare Data Management on AWS.pdfAmazon Web Services
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Turi, Inc.
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Presentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python BasicsPresentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python BasicsShibbir Ahmed
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Srivatsan Ramanujam
 
Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010Michelangelo van Dam
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with pythonKumud Arora
 
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Campmodern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet CampPuppet
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8IndicThreads
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptxVirajPathania1
 

Ähnlich wie 2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai (20)

Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
HLC301-Simplifying Healthcare Data Management on AWS.pdf
HLC301-Simplifying Healthcare Data Management on AWS.pdfHLC301-Simplifying Healthcare Data Management on AWS.pdf
HLC301-Simplifying Healthcare Data Management on AWS.pdf
 
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create Conference 2014: Rajat Arya - Deployment with GraphLab Create
Conference 2014: Rajat Arya - Deployment with GraphLab Create
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Presentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python BasicsPresentation on BornoNet Research Paper and Python Basics
Presentation on BornoNet Research Paper and Python Basics
 
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
 
Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010Advanced Php - Macq Electronique 2010
Advanced Php - Macq Electronique 2010
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
 
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Campmodern module development - Ken Barber 2012 Edinburgh Puppet Camp
modern module development - Ken Barber 2012 Edinburgh Puppet Camp
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 

KĂŒrzlich hochgeladen

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesƁukasz Chruƛciel
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

KĂŒrzlich hochgeladen (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 

2017 Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East talk by DB Tsai

  • 1. Netflix’s Recommendation ML Pipeline using Apache Spark DB Tsai Spark Summit East - Feb 8, 2017
  • 2. At Netflix, we use ML everywhere
  • 4. Over 80% of what members watch comes from our recommendations Recommendations are driven by Machine Learning Algorithms
  • 6. #NetflixEverywhere ● 93+ Million Members ● 190+ Countries ● 125+ Million streaming hours / day ● 1000 hours of Original content in 2017 ● ⅓ of US internet traffic during evenings
  • 8. Try an idea offline using historical data to see if they would have made better recommendations If it does, deploy a live A/B test to see if it performs well in Production Data Driven
  • 9. Running an Experiment Design Experiment Collect Label Dataset DeLorean: Offline Feature Generation Distributed Model Training Parallel training of individual models using different executors Compute Validation Metrics Model Testing Choose best model Design a New Experiment to Test Out Different Ideas Good Metrics Offline Experiment Online A/B Test Online AB Testing Bad Metrics Selected Contexts
  • 10. We use a standardized data format across multiple ranking pipelines This standardized data format is used by common tooling, libraries, and algorithms
  • 11. Contexts: The setting for evaluating a set of items (e.g. tuples of member profiles, country, time, device, etc.) Items: The elements to be trained on, scored, and/or ranked (e.g. videos, rows, search entities) Labels: For supervised learning, this will be the label (target) for each item Ranking problems
  • 12. root |-- profile_id: long (nullable = false) |-- country_iso_code: string (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- show_title_id: long (nullable = false) | | |-- label: double (nullable = false) | | |-- weight: double (nullable = false) | | |-- features: struct (nullable = false) | | | |-- feature1: double (nullable = false) | | | |-- feature2: double (nullable = false) | | | |-- feature3: double (nullable = false) DeLorean Data Format a.k.a DMC-12
  • 13. The nested data structure avoids an expensive shuffle when ranking The features are derived from Netflix data or the output of other trained models The features are persisted in HIVE using Parquet Ensemble methods are used to build rankers
  • 15. Transformer takes an input DataFrame and “lazily” returns an output DataFrame Item Transformer ● Extends Spark ML’s Transformer ● Accepts DMC-12 DataFrame with contextual information ● Transforms DataFrame at the item level
  • 16. Why DataFrame? Catalyst Optimizations Up-front Schema Verification We found a 4x speedup during feature generation by migrating from RDD-based implementation to DataFrame implementation
  • 17. Negative Generator Facts root |-- profile_id: long (nullable = false) |-- country_iso_code: string (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- show_title_id: long (nullable = false) | | |-- label: double (nullable = false) | | |-- weight: double (nullable = false) Facts with synthetic negatives root |-- profile_id: long (nullable = false) |-- country_iso_code: string (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- show_title_id: long (nullable = false) | | |-- label: double (nullable = false) | | |-- weight: double (nullable = false) Creating negatives from what member plays for supervised learning
  • 18. DeLorean Feature Generator root |-- profile_id: long (nullable = false) |-- country_iso_code: string (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- show_title_id: long (nullable = false) | | |-- label: double (nullable = false) | | |-- weight: double (nullable = false) root |-- profile_id: long (nullable = false) |-- country_iso_code: string (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- show_title_id: long (nullable = false) | | |-- label: double (nullable = false) | | |-- weight: double (nullable = false) | | |-- features: struct (nullable = false) | | | |-- feature1: double (nullable = false) | | | |-- feature2: double (nullable = false) | | | |-- feature3: double (nullable = false) Creating features based on common code base in offline and online system http://techblog.netflix.com/2016/02/distributed-time-travel-for-feature.html
  • 19. Creating the Dataset for Algorithms
  • 20. Multithreading Model Training For single machine multi-threading algorithms, we allocate one task to one machine. Multiple tasks are running in Spark for different parameters Broadcast in Spark has datasize limitation, we write data into HDFS, and stream the data into the trainers in executors which run single-machine multi-threading algorithms
  • 21. Distributed Model Training We use both Spark ML’s algorithms and in-house ML implementations We keep the interface similar for both multi-threading and distributed algorithms, so experimenters can try different ideas easily
  • 22. Scoring and Ranking Scorer is also a Transformer returned from the Trainer Multiple models can be scored at the same time in parallel The ranks are derived from sorted scores Together with labels, we can compute metrics, NMRR, NDCG, and Recall, etc root |-- profile_id: long (nullable = false) |-- country_iso_code: string (nullable = false) |-- items: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- show_title_id: long (nullable = false) | | |-- label: double (nullable = false) | | |-- weight: double (nullable = false) | | |-- features: struct (nullable = false) | | | |-- feature1: double (nullable = false) | | | |-- feature2: double (nullable = false) | | | |-- feature3: double (nullable = false) | | |-- scores: struct (nullable = false) | | | |-- model1: double false | | | |-- model2: double false
  • 23. Lessons Learned - Pipeline Abstraction Pros ● Modularity + Tests ● Plug-N-Play ● Notebook Prototyping ● Customizability ● Schema Verification ● Serializability (AC) Cons ● Dependent on Spark platform ● Not easy to bring to production ● Default Metric Evaluator doesn’t support ranking multiple models and type of metrics in one pass