Large-scale item recommendations with Apache Giraph – This is a joint work with Aleksandar Ilic, Facebook Inc.: Recommendation systems try to make personalized item recommendations to users based on available historical information. One of the well-known recommendation techniques is Collaborative Filtering – which is often solved with matrix factorization of a sparse user-item matrix of known ratings. In this talk, we will describe our scalable implementation of SGD and ALS methods for Collaborative Filtering on top of Apache Giraph (an iterative graph processing system built for high scalability on big data).
In order to scale our implementation to over a billion users and tens of millions of items, we developed novel methods for distributing the problem and added several extensions to the Giraph framework. Experiments show that our implementation is up to 10x faster than some of the leading open source implementations in this domain (e.g. Spark MLlib) on the Amazon benchmark data while maintaining the same output quality.
We will describe several additional techniques for handling Facebook’s data (e.g. implicit and skewed item data, different offline metrics) that are required in page and group recommendations. To complete our comprehensive approach for computing recommendations at Facebook, we also implemented an efficient method for finding top-k recommendations per user and item-based recommendations with pairwise item similarities that is easily extendable with different formulas.
7. Challenges
Scale
•100s of billions of (user, item) pairs
•Over billion users
•Tens of millions of items
Performance
•Train models and iterate quickly
•Use more features
10. Iterative and graph processing on massive datasets
Billion vertices, trillion edges
Data mapped to a graph
•Vertex ids and values
•Edges and edge values
What is Apache Giraph?
10
5
1
3
Neural
networks
Logistic
regression
Neural
networks
Boosted
decision
11. What is Apache Giraph?
Runs on top of Hadoop
Map only jobs
Keeps data in memory
Mappers communicate through network
14. Common approach
A bipartite graph:
•Users and items are vertices
•Known ratings are edges
•Feature vectors sent through edges
Problems:
•Data sent per iteration: #knownRatings * #features
•Memory
•Large degree items
•SGD modifications are different than in the sequential solution
Worker 1
Worker 2
Worker 3
I2
I1
I3
I4
15. Our solution - rotational approach
Worker 1
Worker 2
Worker 3
item
set 3
item
set 1
item
set 2
•Network traffic?
•Memory?
•Skewed item degrees?
•SGD calculation?
Users are vertices,
items are worker data
17. Comparison with Spark MLlib
Spark MLlib ALS CF
•On scaled copies of Amazon reviews dataset
We can handle over 100 billion ratings
Cpuminutes
0
150
300
450
600
Millions examples
0 300 600 900 1200
Common approach (in Spark)
Rotational (in Giraph)
18. Hybrid - common + rotational
Choose how to update item based on its degree
Network traffic per item:
•Common: #features * itemDegree
•Rotational SGD: #features * #workers
•Rotational ALS: #features * #features * #workers
20. Slower connections can be a bottleneck
Solution: in every step send items between all workers
Rotating items
21. (#workers - 1) item sets on each worker
Decomposing complete graph into edge disjoint Hamilton cycles
Construction using Latin squares
Rotating items
22. Social signals
Incorporate social network information - social regularization
User’s latent features should be similar to his/her friends
23. Social signals
Easy to add in Giraph model
Additional complexity #friendships * #features
Solves cold start problem
24. Additional features
Tracking rmse, average rank and auc
Combining SGD & ALS
Different objective functions
•Implicit feedback
•Degree based regularization
Incremental training
Fast top K recommendations
25. Calculate item similarities based on:
•Common users
•Global item properties
Adjustable formulas for easy experimentation
Item similarities
?
u
u
u
u
u
u
I1 I2
150M users
15M items
4B ratings
1.3B users
35M items
15B ratings
2.4B users
8M items
220B
ratings
Hive CPU hours 10 227 963
Giraph CPU
hours
3 16 87
Sample datasets
26. Applications
Use user and item embeddings in ranking models
Get user to item score in realtime
Direct user recommendations
Context based recommendations
28. Conclusion
Scalable implementation of Collaborative Filtering
On top of Apache Giraph
Highly performant (100s of billion ratings)
Utilizing social signals and item similarities
Many use cases at Facebook