Recommendation systems. They're a pretty old topic that started way back in the 1990s.
A meetup on it sounds like it'll be boring... if we only talked about the standard user-item matrix collaborative filtering on big data systems.
Thankfully, for this meetup, we'll be sharing on how we can adopt some more recent techniques to recommend products, including social media graphs (and random walks), sequences (and NLP), and PyTorch. The sharing will cover everything starting from data acquisition and preparation, implementation of multiple techniques, and result comparisons. Some familiarity with Python and PyTorch would be useful; minimal math required.
2. About me
§ Lead Data Scientist @ health-tech startup
- Early detection of preventable diseases
- Healthcare resource allocation
§ Previously: VP, Data Science @ Lazada
- E-commerce ML systems
- Facilitated integration with Alibaba
§ More at https://eugeneyan.com
5. Topics*
§ Data Acquisition, Preparation, Split, etc.
§ Conventional Baseline
§ Applying Graph and NLP approaches
* Implementation and results discussed throughout
10. Parsing
json
§ Require parsing json to tabular form
§ Fairly large, with the largest having 142.8
million rows and 20gb on disk
§ Not able to load into ram fully on regular
laptop (16gb ram)
11. def parse_json_to_csv(read_path: str, write_path: str) -> None:
csv_writer = csv.writer(open(write_path, 'w'))
i = 0
for d in parse(read_path):
if i == 0:
header = d.keys()
csv_writer.writerow(header)
csv_writer.writerow(d.values().lower())
i += 1
if i % 10000 == 0:
logger.info('Rows processed: {:,}'.format(i))
logger.info('Csv saved to {}'.format(write_path))
18. Splitting
the data
§ Random split: 2/3 train, 1/3 validation
§ Easy, right?
§ But our dataset only consists of positive
product-pairs—how do we validate?
19. Splitting
the data
§ Random split: 2/3 train, 1/3 validation
§ Easy, right?
§ Not so fast! Our dataset only has positive
product-pairs—how do we validate?
20. Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice
to sample; shuffle when exhausted—fast!
21. Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
22. Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 1
23. Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 2
24. Creating
negative
samples
§ Direct approach: Random sampling
- To create 1 million negative product-pairs, call
random 2 million times—very slow!
§ Hack: Add products in array, shuffle, slice to
sample; re-shuffle when exhausted—fast!
products
----------
B001T9NUFS
0000031895
B007ZN5Y56
0000031909
B00CYBULSO
B004FOEEHC
Negative product-pair 3
26. Batch MF
§ Common approach 1: Load matrix in
memory; apply Python package (e.g.,
scipy.svd, surprise, etc.)
§ Common approach 2: Run on cluster with
SparkML Alternating Least Squares
§ Very resource intensive!
- Is there a smarter way, given the sparse data?
27. Batch MF
§ Common approach 1: Load matrix in
memory; apply Python package (e.g.,
scipy.svd, surprise, etc.)
§ Common approach 2: Run on cluster with
SparkML Alternating Least Squares
§ Very resource intensive!
- Is there a smarter way, given the sparse data?
28. Iterative
MF
§ Only load (or read from disk) product-pairs,
instead of entire matrix that contains zeros
§ Matrix factorization by iterating through
each product-pair
29. Iterative
MF
(numeric
labels,
step 0)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (multiply embeddings and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
30. Iterative
MF
(numeric
labels,
step 1)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
31. Iterative
MF
(numeric
labels,
step 2)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
32. Iterative
MF
(numeric
labels,
step 3)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sum(product1_emb * product2_emb, dim=1)
# Minimize loss
loss = MeanSquaredErrorLoss(prediction, label)
loss.backward()
optimizer.step()
33. Iterative
MF
(binary
labels)
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sig(sum(product1_emb * product2_emb, dim=1))
# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss.backward()
optimizer.step()
34. Regularize!
for product_pair, label in train_set:
# Get embedding for each product
product1_emb = embedding(product1)
product2_emb = embedding(product2)
# Predict product-pair score (interaction term and sum)
prediction = sig(sum(product1_emb * product2_emb, dim=1))
l2_reg = lambda * sum(embedding.weight ** 2)
# Minimize loss
loss = BinaryCrossEntropyLoss(prediction, label)
loss += l2_reg
loss.backward()
optimizer.step()
38. Results
(MF)
Binary labels
AUC-ROC = 0.8083
Time for 5 epochs = 45 min
Continuous labels
AUC-ROC = 0.9225
Time for 5 epochs = 45 min
Figure 3a and 3b. Precision recall curves for Matrix Factorization
40. Learning
curve
(MF)
Figure 4. AUC-ROC across epochs for matrix factorization; Each time learning rate is
reset, the model seems to ”forget”, causing AUC-ROC to revert to ~0.5.
Also, a single epoch seems sufficient
44. Results
(MF-bias)
Binary labels
AUC-ROC = 0.7951
Time for 5 epochs = 45 min
Continuous labels
AUC-ROC = 0.8319
Time for 5 epochs = 45 min
Figure 5a and 5b. Precision recall curves for Matrix Factorization with bias
46. Off the Beaten Path
Natural language processing (“NLP”) and Graphs in RecSys
47. Word2Vec
§ In 2013, two seminal papers by Tomas
Mikolov on Word2Vec (”w2v”)
§ Demonstrated w2v could learn semantic
and syntactic word vector representations
§ TL; DR: Converts words into numbers (array)
48. DeepWalk
§ Unsupervised learning of representations of
nodes (i.e., vertices) in a social network
§ Generate sequences from random walks
on (social) graph
§ Learn vector representations of nodes
(e.g., profiles, content)
50. How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
51. How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
52. How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
53. How do
NLP and
Graphs
matter?
§ Create graph from product-pairs + weights
§ Generate sequences from graph (via
random walk)
§ Learn product embeddings (via word2vec)
§ Recommend based on embedding similarity
(e.g., cosine similarity, dot product)
55. Creating a
product
graph
§ We have product-pairs and weights
- These are our graph edges
§ Create a weighted graph with networkx
- Each graph edge is given a numerical weight,
instead of all edges having same weight
product1 | product2 | weight
--------------------------------
B001T9NUFS | B003AVEU6G | 0.5
0000031895 | B002R0FA24 | 0.5
B007ZN5Y56 | B005C4Y4F6 | 0.5
0000031909 | B00538F5OK | 1.0
B00CYBULSO | B00B608000 | 1.1
B004FOEEHC | B00D9C32NI | 1.2
Table 2. Product-pairs and weights
56. Random
Walks
§ Direct approach: Traverse networkx graph
- For 10 sequences of length 10 for a starting node,
need to traverse 100 times
- 2 mil nodes for books graph = 200 mil queries
- Very slow and memory intensive
§ Hack: Work with transition probabilities
57. Random
Walks
§ Direct approach: Traverse networkx graph
- For 10 sequences of length 10 for a starting node,
need to traverse 100 times
- 2 mil nodes for books graph = 200 mil queries
- Very slow and memory intensive
§ Hack: Work directly on transition probabilities
64. Node2Vec
§ Seemed to work out of the box
- Just need to provide edges
- Uses networkx and gensim under the hood
§ But very memory intensive and slow
- Could not run to completion even on 64gb ram
https://github.com/aditya-grover/node2vec
68. Results
(gensim
w2v)
All products
AUC-ROC = 0.9082
Time for 5 epochs = 2.58 min
Seen products only
AUC-ROC = 0.9735
Time for 5 epochs = 2.58 min
Figure 6a and 6b. Precision recall curves for gensim.word2vec
71. Data
Loader
§ Input sequences instead of product-pairs
§ Implements two features from w2v papers
- Subsampling of frequent words
- Negative sampling
72. Data
Loader
(sub-
sampling)
§ Drop out words of higher frequency
- Frequency of 0.0026 = 0.0 dropout
- Frequency of 0.00746 = 0.5 dropout
- Frequency of 1.0 = 0.977 dropout
§ Accelerated learning and improved
vectors of rare words
𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 −
𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑
0.001
+ 1 ×
0.001
𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)
73. Data
Loader
(sub-
sampling)
§ Drop out words of higher frequency
- Frequency of 0.0026 = 0.0 dropout
- Frequency of 0.00746 = 0.5 dropout
- Frequency of 1.0 = 0.977 dropout
§ Accelerated learning and improved
vectors of rare words
𝐷𝑟𝑜𝑝𝑜𝑢𝑡 𝑃𝑟𝑜𝑏 𝑤𝑜𝑟𝑑 = 1 −
𝐹𝑟𝑒𝑞 𝑤𝑜𝑟𝑑
0.001
+ 1 ×
0.001
𝐹𝑟𝑒𝑞(𝑤𝑜𝑟𝑑)
74. Data
Loader
(Negative
sampling)
§ Original skip-gram ends with SoftMax
- If vocab = 10k words, embedding dim = 128,
1.28 million weights to update—expensive!
- In RecSys, ”vocab” in the millions
§ Negative sampling
- Only modify weights of negative pair samples
- If 6 pairs (1 pos, 5 neg) and 1 mil products, only
update 0.0006 weights—efficient!
75. Data
Loader
(Negative
sampling)
§ Original skip-gram ends with SoftMax
- If vocab = 10k words, embedding dim = 128,
1.28 million weights to update—expensive!
- In RecSys, ”vocab” in the millions
§ Negative sampling
- Only modify weights of negative pair samples
- If 6 pairs (1 pos, 5 neg) and 1 mil products, only
update 0.0006% weights—very efficient!
81. Results
(w2v)
Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec
All products
AUC-ROC = 0.9554
Time for 5 epochs = 23.63 min
82. Results
(w2v)
Figure 7a and 7b. Precision recall curves for PyTorch Word2Vec
All products
AUC-ROC = 0.9554
Time for 5 epochs = 23.63 min
Seen products only
AUC-ROC = 0.9855
Time for 5 epochs = 23.63 min
84. Overall
results so
far
§ Improvement on gensim.word2vec and
Alibaba paper
All products Seen products only
PyTorch MF 0.7951 -
Gensim w2v 0.9082 0.9735
PyTorch w2v 0.9554 0.9855
Alibaba Paper* 0.9327 -
* Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349)
Table 4. AUC-ROC across various implementations
86. Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
87. Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
88. Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Television Sound bar Lamp Standing Fan
Sony Sony Phillips Dyson
500 – 600 200 – 300 50 – 75 300 - 400
89. Extending
w2v
§ For each product, we have information like
category, brand, price group, etc.
- Why not add this when learning embeddings?
§ Alibaba paper reported AUC-ROC
improvement from 0.9327 to 0.9575
B001T9NUFS -> B003AVEU6G -> B007ZN5Y56 ... -> B007ZN5Y56
Television Sound bar Lamp Standing Fan
Sony Sony Phillips Dyson
500 – 600 200 – 300 50 – 75 300 - 400
90. Weighting
side info
§ Two version were implemented
§ 1: Equal-weighted average of embeddings
§ 2: Learn weightage for each embedding
and applying weighted average
92. Why
doesn’t it
work?!
§ Perhaps due to sparsity of metadata
- Of 418,749 electronics, metadata available for
162,023 (39%); Of these, brand was 51% empty
§ But I assumed the weights of the (useless)
embeddings would be learnt— ¯_(ツ)_/¯
§ An example of more data ≠ better
93. Why
doesn’t it
work?!
§ Perhaps due to sparsity of metadata
- Of 418,749 electronics, metadata available for
162,023 (39%); Of these, brand was 51% empty
§ But I assumed the weights of the (useless)
embeddings would be learnt— ¯_(ツ)_/¯
§ An example of more data ≠ better
95. Mixing it
up to pull
it apart
§ Why does w2v perform so much better?
§ For the fun of it, lets use the MF-bias model
with sequence data (used in w2v)
96. Results &
learning
curve
Figure 10a and 10b. Precision recall curve and learning curve
for PyTorch MF-bias with sequences
All products
AUC-ROC = 0.9320
Time for 5 epochs = 70.39 min
98. Embed
everything
§ Building user embeddings in the same vector
space as products (Airbnb)
- Train user embeddings based on interactions with
products (e.g., click, ignore, purchase)
§ Embed all discrete features and just learn
similarities (Facebook)
§ Graph Neural Networks for embeddings;
node neighbors as representation (Uber Eats)
100. Overall
results
(electronics)
All products
Seen products
only
Runtime (min)
PyTorch MF 0.7951 - 45
Gensim w2v 0.9082 0.9735 2.58
PyTorch w2v 0.9554 0.9855 23.63
PyTorch w2v
with side info
NA NA NA
PyTorch MF with
sequences
0.9320 - 70.39
Alibaba Paper* 0.9327 - -
* Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba (https://arxiv.org/abs/1803.02349)
Table 5. AUC-ROC across various implementations (electronics)
101. Overall
results
(books)
All products
Seen products
only
Runtime (min)
PyTorch MF 0.4996 - 1353.12
Gensim w2v 0.9701 0.9892 16.24
PyTorch w2v 0.9775 - 122.66
PyTorch w2v
with side info
NA NA NA
PyTorch MF with
sequences
0.7196 - 1393.08
Table 6. AUC-ROC across various implementations (books)
102. § Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
103. § Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
104. § Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
105. § Don’t just look at numeric metrics—plot some curves!
- Especially if you need some arbitrary threshold (i.e., classification)
§ Matrix Factorization is an okay-ish baseline
§ Word2vec is a great baseline
§ Training on sequences is epic
107. References
McAuley, J., Targett, C., Shi, Q., & Van Den Hengel, A. (2015, August). Image-based
recommendations on styles and substitutes. In Proceedings of the 38th International ACM
SIGIR Conference on Research and Development in Information Retrieval (pp. 43-52). ACM.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111-3119).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781.
Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social
representations. In Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 701-710). ACM.
Grover, A., & Leskovec, J. (2016, August). node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery
and data mining (pp. 855-864). ACM.
108. References
Wang, J., Huang, P., Zhao, H., Zhang, Z., Zhao, B., & Lee, D. L. (2018, July). Billion-scale
commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the
24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp.
839-848). ACM.
Grbovic, M., & Cheng, H. (2018, July). Real-time personalization using embeddings for search
ranking at airbnb. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (pp. 311-320). ACM.
Wu, L. Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., & Weston, J. (2018, April). Starspace:
Embed all the things!. In Thirty-Second AAAI Conference on Artificial Intelligence.
Food Discovery with Uber Eats: Using Graph Learning to Power Recommendations,
https://eng.uber.com/uber-eats-graph-learning/, retrieved 10 Jan 2020