In the last years a lot of improvements were done in the field of Machine Learning and the Tools that support the community of developers. But still, implementing a recommender system is very hard.
That is why at Crossing Minds, we decided to create a series of 4 meetups to discuss how to implement a recommender system end-to-end:
Part 1 – The Right Dataset
Part 2 – Model Training
Part 3 – Model Evaluation
Part 4 – Real-Time Deployment
This first meetup will be about building the right dataset and doing all the preprocessing needed to create different models. We will talk about explicit vs implicit feedback, dataset analysis, likes/dislikes vs ratings, users and items features, normalization and similarities.
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Recommender Systems from A to Z – The Right Dataset
1.
2. Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
3. Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
4. 1. Having the right Data
Explicit vs Implicit
Likes/Dislikes vs Ratings
1. Rating Dataset Analysis
Density
Connectivity
1. Items Features and Users Features
Unsupervised Learning
Supervised Learning
1. Data Preprocessing
Unsupervised Dimensionality Reduction
Supervised Dimensionality Reduction
5.
6. Having the Right Data – Explicit & Implicit Feedback
Explicit Feedback
Implicit Feedback
7. Having the Right Data – Explicit & Implicit Feedback
Explicit Feedback
● Offers the preferences itself
● Clean data (aligned with your goal)
● Cost to collect
Implicit Feedback
● Offers a level of confidence on user preferences
● Very easy to have a lot
● dangerous to interpret
8. Having the Right Data – Implicit vs Like/Dislike vs Ratings
Explicit Feedback
● Classification (e.g. like/dislike/skip)
● Regression (e.g. star ratings)
● Ranking (e.g. pairwise comparison)
Implicit Feedback
● With Implicit Negative Feedback (e.g. watch-time or play-time of media, like/skip action)
● Without Implicit Negative Feedback (e.g. like only, search history, purchase history)
9. Having the Right Data – Implicit vs Like/Dislike vs Ratings
Explicit Feedback
● Classification (e.g. like/dislike/skip)
● Regression (e.g. star ratings) => best data to compute absolute prediction of taste
● Ranking (e.g. pairwise comparison) => best data to compute top-k recommendations
Implicit Feedback
● With Implicit Negative Feedback (e.g. watch-time or play-time of media, like/skip action)
● Without Implicit Negative Feedback (e.g. like only, search history, purchase history)
=> test evaluation require bias and model selection is very hard
10. Having the Right Data – Implicit vs Like/Dislike vs Ratings
Explicit Feedback
● Classification (e.g. like/dislike/skip)
● Regression (e.g. star ratings) => best data to compute absolute prediction of taste
● Ranking (e.g. pairwise comparison) => best data to compute top-k recommendations
Implicit Feedback
● With Implicit Negative Feedback (e.g. watch-time or play-time of media, like/skip action)
● Without Implicit Negative Feedback (e.g. like only, search history, purchase history)
=> test evaluation require bias and model selection is very hard
Take-Home
the data you have affects how you train your models!
12. Context
“Context is any information that can be used to characterize the situation of an entity” -
Anind K. Dey 2001
13. Context
“Context is any information that can be used to characterize the situation of an entity” -
Anind K. Dey 2001
Representative Context
Fully Observable and static
Interactive Context
Non-fully observable and dynamic
15. Context – Model
Rating Dataset
Instead of tuple (user, item, rating), we consider (user, item, context, rating)
Model
For similarity-based model (user-user or item-item), we need to modify how we compute
the similarity to take context into account
For matrix-factorization model (user-item), we need to add a dimension and use tensor-
factorization instead, which is much more challenging
19. Rating Dataset Analysis – Density & Connectivity
General Principle in Collaborative-Filtering
The ability to learn anything on a user or an item is driven by its degree in the graph.
The ability to recommend an item to a user is driven by how connected they are in the graph.
Density and Sparsity
Density of a graph with users, items and ratings = (typically in [0.001–0.01])
Connectivity
There is no information learnt from a user or an item with degree one.
Example: if we have one user with 100 ratings on items with only one rating each, we can remove all
these items, the user and its 100 ratings from the dataset
31. Items & Users Features
1. Quantitative Features
2. Knowledge Graph
3. Deep Content Extraction
32. Items & Users Features – Quantitative Features
Discrete
● number of episodes in TV shows
● number of purchase made by user
Continuous
● price of item
● age of user
● movie budget
● date released
33. Items & Users Features – Quantitative Features
For similarity-based models (user-user or item-item):
concatenate rating matrices and features, and use same similarity metric (e.g. dot product)
C = Cost, Y = Year, D =
Duration
34. Items & Users Features – Quantitative Features
For embedding-based models (user-item):
compute embedding on rating matrices only, and then concatenate embeddings with features
C = Cost, Y = Year, D =
Duration
35. Items & Users Features – Knowledge Graph
1. Quantitative Features
2. Knowledge Graph
3. Deep Content Extraction
36. Items & Users Features – Knowledge Graph
One-to-many (Categorical)
● type of item
● author of a book
● gender of user
Many-to-many (Ontological)
● tags/labels/genres of an item
● all actors of a movie
● selected preferences of user
37. Items & Users Features – Knowledge Graph
For similarity-based model (item-item, user-user):
concatenate rating matrices and knowledge graph seen as a sparse matrix, and use same similarity
metric (e.g. dot product)
D =Drama, A = Action, R = Romance
38. Items & Users Features – Knowledge Graph
For embeddings-based model (user-item):
We first need to convert the graph-based item-features into dense vectors (dimension reduction), and
then concatenate these vectors to the embeddings
39. Items & Users Features – Deep Content Extraction
1. Quantitative Features
2. Knowledge Graph
3. Deep Content Extraction
40. Items & Users Features – Deep Content Extraction
Every single item is not just about the available
meta-data.
Encode information from:
● Images (CNN)
● Text Information (NLP)
● Audio (LSTM)
Input
A documentary which examines the
creation and co-production of the
popular children’s television
program in three developing
countries: Bangladesh, Kosovo, and
South Africa.
Prediction
Comedy,
Adventure, Family,
Animation
In his spectacular film debut,
young Babar, King of the
Elephants must save his homeland
from certain destruction by
Rataxes and his band of invading
rhinos.
Documentary, History
Comedy,
Adventure, Family,
Animation
Adventure, War,
Documentary, Music
41. Items & Users Features – Deep Content Extraction – Images
Pre-trained Convolutional Neural Networks
are widely available
● ResNet50
● Vgg16
● AlexNet
42. Items & Users Features – Deep Content Extraction – Text
Pre-trained NLP models are widely available
● Word2vec, GloVe, FastText
● SkipThought
● Universal Sentence Encoders
● Elmo
Note: pre-trained complex models like Bi-LSTM do not
work well for cross-domain
44. Data Preprocessing
Goal
Given a (sparse) matrix of items features I (n-items, n-entities), find the best matrix W (n-entities, d) so
that IW is a dense matrix (n-items, d) that can be used concatenated to item embeddings.
Unsupervised vs Supervised
We say “supervised dimension reduction” when we use ratings
Supervised works better if the items with ratings are aligned with items with features.
Unsupervised works better if you have much more items with features than items with ratings.
46. Data Preprocessing – PCA
PCA (Principal Component Analysis) is a well known technique for doing feature extraction
PCA projects the data into a new feature space with less dimensions that the original one, and at the
same time, retaining the most relevant information
Feature space of dimension 3 Feature space of dimension 2
● PCA reduce the dimension of
the input data by considering
the dimensions with higher
variance.
● PCA can also by applied to
sparse data.
47. Data Preprocessing – Unsupervised Random Projection
Random Projection (RP) is another technique for doing dimension reduction
We multiply I by a random matrix T, and verify that the distance between two points is preserved after
the transformation within a certain error
Advantage
● RP is computationally more efficient than PCA
● It’s useful in very high dimension scenarios
Disadvantage
● PCA is the optimal linear projection from an space of dimension d to an space of dimension d’
(d >= d’)
48. Data Preprocessing – Unsupervised Deep Learning
Graph embeddings Algorithms
● Node2vec
● DeepWalk
● Line
Not often used, so there are no robust tools. They’re all on github in python/C++
Theoretical Remarks
They are actually converging to matrix-factorization of Laplacian-like normalization of the graph,
but may be more flexible and memory-friendly
50. Data Preprocessing – Supervised Linear Dimensionality Reduction
Given R (n-users, n-items) sparse and I (n-items, n-entities) sparse, find the best matrix W (n-
entities, d) to learn R with a linear model:
✓ Works for both dense and sparse features
51. Data Preprocessing – Deep Learning Dimensionality Reduction
Directly add the Knowledge Graph as part of the training data (not pre-processing anymore)
Learn embeddings for user, item, user-entities, item-entities together
52. Data Preprocessing – Deep Learning Dimensionality Reduction
Directly add the Knowledge Graph as part of the training data (not pre-processing anymore)
Learn embeddings for user, item, user-entities, item-entities together
54. The Right Dataset – Summary
Data > Pre-processing > Model
● The rating graph needs to be as dense and connected as possible
● Explicit feedback is better than Implicit feedback if you can
● The type of the ratings (binary vs continuous) will affect how you train models.
● Having Negative Feedback is important
● Context helps adding information
● User Features and Item Features help adding information, but require heavy pre-processing