SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Maja Kabiljo & Aleksandar Ilic,
Facebook
Large scale item recommendations
with Apache Giraph
Conclusion
04
01
02
05
03
Motivation and challenges
Apache Giraph
Distributing Collaborative Filtering
Social signals and Applications
Collaborative
Filtering
Collaborative Filtering
Predict user’s interests based on many other users
Disney Roller coasters Disneyland Six Flags
?
? ?
? ?
?
? ? ?
Matrix factorization
4 4 1 3
5 3 1
1 2 4
5 3 4 5
2 3
...
. . .
...
U1
U2
U3
U4
users
...
U5
. . .I1 I2 I3 I4
items
I5
?
Basic form
Objective function
Two iterative approaches:
•Stochastic Gradient Descent
•Alternating Least Squares
Challenges
Scale
•100s of billions of (user, item) pairs
•Over billion users
•Tens of millions of items
Performance
•Train models and iterate quickly
•Use more features
Training / testing metrics exampleRMSE
0
0.2
0.4
0.6
0.8
Iterations
0 4 8 12 16 20 24 28 32 36 40 44
Train f=8
Test f=8
Train f=128
Test f=128
Apache Giraph
Iterative and graph processing on massive datasets
Billion vertices, trillion edges
Data mapped to a graph
•Vertex ids and values
•Edges and edge values
What is Apache Giraph?
10
5
1
3
Neural
networks
Logistic
regression
Neural
networks
Boosted
decision
What is Apache Giraph?
Runs on top of Hadoop
Map only jobs
Keeps data in memory
Mappers communicate through network
Giraph workflow
Worker 1
Worker 2
Worker 3
Distributing
Collaborative Filtering
Common approach
A bipartite graph:
•Users and items are vertices
•Known ratings are edges
•Feature vectors sent through edges
Problems:
•Data sent per iteration: #knownRatings * #features
•Memory
•Large degree items
•SGD modifications are different than in the sequential solution
Worker 1
Worker 2
Worker 3
I2
I1
I3
I4
Our solution - rotational approach
Worker 1
Worker 2
Worker 3
item
set 3
item
set 1
item
set 2
•Network traffic?
•Memory?
•Skewed item degrees?
•SGD calculation?
Users are vertices,
items are worker data
Performance
Comparison with Spark MLlib
Spark MLlib ALS CF
•On scaled copies of Amazon reviews dataset
We can handle over 100 billion ratings
Cpuminutes
0
150
300
450
600
Millions examples
0 300 600 900 1200
Common approach (in Spark)
Rotational (in Giraph)
Hybrid - common + rotational
Choose how to update item based on its degree
Network traffic per item:
•Common: #features * itemDegree
•Rotational SGD: #features * #workers
•Rotational ALS: #features * #features * #workers
Extensions
Slower connections can be a bottleneck
Solution: in every step send items between all workers
Rotating items
(#workers - 1) item sets on each worker
Decomposing complete graph into edge disjoint Hamilton cycles
Construction using Latin squares
Rotating items
Social signals
Incorporate social network information - social regularization
User’s latent features should be similar to his/her friends
Social signals
Easy to add in Giraph model
Additional complexity #friendships * #features
Solves cold start problem
Additional features
Tracking rmse, average rank and auc
Combining SGD & ALS
Different objective functions
•Implicit feedback
•Degree based regularization
Incremental training
Fast top K recommendations
Calculate item similarities based on:
•Common users
•Global item properties
Adjustable formulas for easy experimentation
Item similarities
?
u
u
u
u
u
u
I1 I2
150M users
15M items
4B ratings
1.3B users
35M items
15B ratings
2.4B users
8M items
220B
ratings
Hive CPU hours 10 227 963
Giraph CPU
hours
3 16 87
Sample datasets
Applications
Use user and item embeddings in ranking models
Get user to item score in realtime
Direct user recommendations
Context based recommendations
Conclusion
Conclusion
Scalable implementation of Collaborative Filtering
On top of Apache Giraph
Highly performant (100s of billion ratings)
Utilizing social signals and item similarities
Many use cases at Facebook
Thank you!
tinyurl.com/fb-mf-cf
tinyurl.com/giraph-vldb-2015
Questions?

Weitere ähnliche Inhalte

Mehr von MLconf

Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...MLconf
 
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...MLconf
 
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...MLconf
 
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...MLconf
 
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionNitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionMLconf
 
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...MLconf
 
Alexandra johnson reducing operational barriers to model training
Alexandra johnson   reducing operational barriers to model trainingAlexandra johnson   reducing operational barriers to model training
Alexandra johnson reducing operational barriers to model trainingMLconf
 

Mehr von MLconf (20)

Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
 
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
 
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
 
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
 
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionNitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
 
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
 
Alexandra johnson reducing operational barriers to model training
Alexandra johnson   reducing operational barriers to model trainingAlexandra johnson   reducing operational barriers to model training
Alexandra johnson reducing operational barriers to model training
 

Kürzlich hochgeladen

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdfPaige Cruz
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Juan Carlos Gonzalez
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementNuwan Dias
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 

Kürzlich hochgeladen (20)

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf99.99% of Your Traces  Are (Probably) Trash (SRECon NA 2024).pdf
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?Governance in SharePoint Premium:What's in the box?
Governance in SharePoint Premium:What's in the box?
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
The Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API ManagementThe Kubernetes Gateway API and its role in Cloud Native API Management
The Kubernetes Gateway API and its role in Cloud Native API Management
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 

Maja Kabiljo, Software Engineer, Facebook Inc. at MLconf ATL - 9/18/15

  • 1. Maja Kabiljo & Aleksandar Ilic, Facebook Large scale item recommendations with Apache Giraph
  • 2. Conclusion 04 01 02 05 03 Motivation and challenges Apache Giraph Distributing Collaborative Filtering Social signals and Applications
  • 4. Collaborative Filtering Predict user’s interests based on many other users Disney Roller coasters Disneyland Six Flags
  • 5. ? ? ? ? ? ? ? ? ? Matrix factorization 4 4 1 3 5 3 1 1 2 4 5 3 4 5 2 3 ... . . . ... U1 U2 U3 U4 users ... U5 . . .I1 I2 I3 I4 items I5 ?
  • 6. Basic form Objective function Two iterative approaches: •Stochastic Gradient Descent •Alternating Least Squares
  • 7. Challenges Scale •100s of billions of (user, item) pairs •Over billion users •Tens of millions of items Performance •Train models and iterate quickly •Use more features
  • 8. Training / testing metrics exampleRMSE 0 0.2 0.4 0.6 0.8 Iterations 0 4 8 12 16 20 24 28 32 36 40 44 Train f=8 Test f=8 Train f=128 Test f=128
  • 10. Iterative and graph processing on massive datasets Billion vertices, trillion edges Data mapped to a graph •Vertex ids and values •Edges and edge values What is Apache Giraph? 10 5 1 3 Neural networks Logistic regression Neural networks Boosted decision
  • 11. What is Apache Giraph? Runs on top of Hadoop Map only jobs Keeps data in memory Mappers communicate through network
  • 14. Common approach A bipartite graph: •Users and items are vertices •Known ratings are edges •Feature vectors sent through edges Problems: •Data sent per iteration: #knownRatings * #features •Memory •Large degree items •SGD modifications are different than in the sequential solution Worker 1 Worker 2 Worker 3 I2 I1 I3 I4
  • 15. Our solution - rotational approach Worker 1 Worker 2 Worker 3 item set 3 item set 1 item set 2 •Network traffic? •Memory? •Skewed item degrees? •SGD calculation? Users are vertices, items are worker data
  • 17. Comparison with Spark MLlib Spark MLlib ALS CF •On scaled copies of Amazon reviews dataset We can handle over 100 billion ratings Cpuminutes 0 150 300 450 600 Millions examples 0 300 600 900 1200 Common approach (in Spark) Rotational (in Giraph)
  • 18. Hybrid - common + rotational Choose how to update item based on its degree Network traffic per item: •Common: #features * itemDegree •Rotational SGD: #features * #workers •Rotational ALS: #features * #features * #workers
  • 20. Slower connections can be a bottleneck Solution: in every step send items between all workers Rotating items
  • 21. (#workers - 1) item sets on each worker Decomposing complete graph into edge disjoint Hamilton cycles Construction using Latin squares Rotating items
  • 22. Social signals Incorporate social network information - social regularization User’s latent features should be similar to his/her friends
  • 23. Social signals Easy to add in Giraph model Additional complexity #friendships * #features Solves cold start problem
  • 24. Additional features Tracking rmse, average rank and auc Combining SGD & ALS Different objective functions •Implicit feedback •Degree based regularization Incremental training Fast top K recommendations
  • 25. Calculate item similarities based on: •Common users •Global item properties Adjustable formulas for easy experimentation Item similarities ? u u u u u u I1 I2 150M users 15M items 4B ratings 1.3B users 35M items 15B ratings 2.4B users 8M items 220B ratings Hive CPU hours 10 227 963 Giraph CPU hours 3 16 87 Sample datasets
  • 26. Applications Use user and item embeddings in ranking models Get user to item score in realtime Direct user recommendations Context based recommendations
  • 28. Conclusion Scalable implementation of Collaborative Filtering On top of Apache Giraph Highly performant (100s of billion ratings) Utilizing social signals and item similarities Many use cases at Facebook