The third meetup will be about evaluating different models for our recommender system. We will review the strategies we have to check if a model is under fitting or overfitting. After that, we will present and analyze the losses that are typically used in recommendation systems to train models. We will compare regression, classification, and rank based losses and when it's convenient to use each one. Finally, we are going to cover all the metrics that are typically used to evaluate the performance of different recommendation systems and how to test that the models are giving good results in production.
7. Previous Meetup Recap: Recommendation Engine Models
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
9. Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
10. Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
Ratings of User #1
Embedding of User #1
Embedding of Item #1
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Ratings of User #m
To Item #n
11. Optimization Problem – Matrix Factorization Example
Optimization problem (definitions)
AVAILABLE DATASET
?
?
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
12. Optimization Problem – Matrix Factorization Example
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
13. Optimization Problem – Matrix Factorization Example
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
3. How are we going to solve the problem?
2. What properties are we looking in our outputs?
- Exact rating vs like/dislike vs ranking predictions
1. What type of data do we have?
14. Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
Business decisions
Technical decisions
15. Ask the Right Questions
(1) What type of data do we have?
(2) What properties are we looking in our outputs?
(3) How are we going to solve the problem?
(4) Which hyper-parameters of my model are the best?
(5) Which model is the best?
EVALUATION FUNCTIONS
LOSS FUNCTIONS
RANDOM SEARCH, GP
COMPARE METRICS
ML FOR RECOMMENDATION
Business decisions
Technical decisions
16. Objectives Types (from data point of view)
Classification
● clic/no-click
● like/dislike/missing
● estimated probability of like (e.g. watch time)
Regression
● absolute rating (e.g. from 1/5 to 5/5)
● number of interactions
Ranking
● estimated order of preference (e.g. watch time)
● pairwise comparisons
Unsupervised
● clustering of items
● clustering of users
17. Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
18. Choosing the Right Objective (from business point of view)
Absolute Predictions vs Relative Predictions
Does only the order of the predictions matter?
Sensitivity vs Specificity
Is false positive worst than false negative?
Skewness
Is misclassifying an all-star favorite worst than misclassifying a casual like?
LOSS FUNCTION THAT PENALIZE MORE
ERRORS IN ALL-STAR RATING
RANKING LOSS FUNCTION
CLASSIFICATION LOSS FUNCTION
21. Cross Validation – In Recommendation Engines
Split such as every user is present in train and valid
More stronger: split as every user have 80/20 train and valid
Dataset
23. Underfitting and Overfitting
Model fails to learn
relations in data
Model is a good fit
for the data
Model fails to
generalize
New samples New samples New samples
+ Complex
26. Underfitting and Overfitting
epoch
Loss
Function
or
Metric
Mini-Batch Gradient Descent
for epoch in n_epochs:
● shuffle the batches
● for batch in n_batches:
○ compute the predictions for the batch
○ compute the error for the batch
○ compute the gradient for the batch
○ update the parameters of the model
● plot error vs epoch
27. Underfitting and Overfitting
A very simple way of checking underfitting
Ground truth
Y
Model predictions
Model is predicting always the same
Predicted Y
Underfitting
29. What do we want to evaluate?
Classification
● True Positive Rate (TPR)
● True Negative Rate (TNR)
● Precision
● F-measure
Regression
● Mean Square Error (MSE)
Ranking
● Recall@K
● Precision@K
● CG, DCG, nDCG
Ranking/Classification metrics
● AUC
Some common evaluation functions
30. Regression
Mean Square Error (MSE)
● Easy to compute
● Linear gradient
● Can also be used as loss function
Mean Absolute Error (MAE)
● Easy to compute
● Easy to interpret
● Discontinuous gradient
● Can’t be used as loss function
31. Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
User’s likes
User’s dislikes
Model recommendations
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
32. Classification – Precision vs Recall
TS = Toy Story
KP = Kung Fu Panda
TD = How to train your dragon
A = Annabelle
User’s likes
User’s dislikes
Model recommendations
Recall = 5/7
Precision = 5/5 = 1
Recall = 7/7 = 1
Precision = 7/9
Model 1 Model 2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
TS1
TS2
TS3
KP1
KP2
TS4
KP3
A1
A2
33. Classification 1/2
True Positive Rate (a.k.a TPR, Recall, Sensitivity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TPR <=1 (higher is better)
True Negative Rate (a.k.a TNR, Selectivity, Specificity)
● Easy to understand
● Useful for likes/dislikes datasets
● Measure of global bias of a model
● 0 <= TNR <=1 (higher is better)
34. Classification 2/2
Precision
● Easy to understand
● Useful for likes/dislikes datasets
● Measure quality of recommendation
● 0 <= Precision <=1 (higher is better)
F-measure
● Balance precision and recall
● Not good for recommendation, because
doesn’t take into account True Negatives
● 0 <= F-measure <= 1 (higher is better)
35. Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Total Positive = ?
Recall@K = ?
36. Ranking 1/3
Recall@K
● Count the positive items of the top K items predicted for each user
● Divides that number by the number of positive items for each user
● A perfect score is 1 if the user has K or less positive items and they all appear in the predicted top K
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Total Positive = 4
Recall@K = 2 / 4
top 1
top 2
top 3
38. Ranking 2/3
Precision@K
● Count the positive items of the top K items predicted for each user
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
TOP K Positive = ?
Recall@K = ?
39. Ranking 2/3
Precision@K
● Count the positive items of the top K items predicted for each user
● Divides that number by K for each user
● A perfect score is 1 if the user has K or more positive items and the top K only contains positives
● Independent of the exact values of the predictions, only their relative rank matters
Movie
Toy Story 1 1.0 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
TOP K Positive = {TS1, TS2} = 2
Recall@K = 2 / 3
top 1
top 2
top 3
41. Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect score is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = ?
CG = ?
DCG = ?
42. Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
● The bigger the score the better
Movie
Toy Story 1 1 0.9
Toy Story 2 0.9 0.7
Kung Fu Panda 1 0.7 0.1
Kung Fu Panda 2 0.6 -0.1
Annabelle 1 -0.2 0.4
K = 3
TOP K = {TS1, TS2, A1}
CG = 1.0 + 0.9 - 0.2
DCG = 1/1 + 0.9/2 - 0.2/3
top 1
top 2
top 3
43. Ranking 3/3
CG, DCG, and nDCG
● CG: Sum the true ratings of the Top K items predicted for each user
● DCG: Weight by position in Top K; nDCG: Normalize in [0, 1]
● A perfect nDCG is 1 if the ranking of the prediction is the same as the ranking of the true ratings
44. Hybrid Ranking/Classification
AUC
● Vary positive prediction threshold (not just 0)
● Compute TPR and FPR for all possible positive thresholds
● Build Receiver Operating Characteristic (ROC) curve
● Integrate Area Under the ROC Curve (AUC)
46. Loss Functions vs Evaluation Functions
Evaluation Metrics
● Expensive to evaluate
● Often not smooth
● Often not even derivable
Loss Functions
● Smooth approximations of your evaluation metric
● Well suited for SGD
47. Loss Functions: How we are going to solve the problem?
Classification loss
● Logistic
● Cross Entropy
● Kullback-Leibler Divergence
Regression loss
● Mean Square Error (MSE)
Ranking loss
● WARP
● BPR
Some common loss functions
48. Optimization Problems – Basic Formulation with RMSE
Goal: find U and I s.t. the difference between each datapoint in R and and the product between each
user and item is minimal
(or R)
49. Optimization Problems – General Formulation
Goal: find U and I s.t. the loss function J is minimized.
(or R)
53. Loss Functions – Regression
Mean Square Error
● Typical used loss function for regression. It’s a smooth function. It’s easy to understand.
Regularized Mean Square Error
● Mean square error plus regularization to avoid overfitting.
54. Loss Functions – Classification
Logistic
● Typical used loss function for classification. Smooth gradient around zero and steep for large errors.
58. Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users
59. Practical Recommendations
(1) Always compute baseline metrics
(2) Always analyze underfitting vs overfitting
(3) Always do hyperparameter optimization
(4) Always compute multiple metrics for your models
(5) Always analyze the clustering properties of the items/users
(6) Always ask feedback from end users
COMPARE WITH GLOBAL MODELS IT’S EASY
IF OVERFITTING, USE REGULARIZATION
GRID SEARCH OR GAUSSIAN PROCESS
TPR, TNR, PRECISION, ETC.
ITEM/ITEM SIMILARITIES
EVERYTHING IS ABOUT USER TASTE
60. (1) Always compute baseline metrics
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
61. (2) Always analyze underfitting vs overfitting
Model-based
● Dropout
● Bagging
Loss-based normalization
● norm: best approximation of sparsity-inducing norm
● norm: very smooth, easy to optimize
Data Augmentation
● Negative Sampling
62. (3) Always do hyperparameter optimization
Grid Search
Brute force over all the combinations of the parameters
Exponential cost: for 20 parameters, to get only 10 evaluations each, you need 10^20 complete runs
Random Search
Uniformly sample combinations of the parameters
Very easy to implement, very useful in practice
Gaussian Process Optimization
Meta-learning of the validation error given hyper-parameters
Solve exploration/exploitation tradeoff
63. (3) Always do hyperparameter optimization
Metric to minimize
Metric to maximize
65. (5) Always analyze the clustering properties of the items/users
Items embeddings
● In general, we combine items embeddings with: FEATURES | IMAGE EMBS | NLP EMBS
● After getting the embeddings, we always compute Top-K similarities in well known items
● We use the items embeddings to create clusters and analyze how good they are
66. (5) Always ask for final users feedback
RECOMMENDATION IS ALL ABOUT USERS TASTE
ASK THEM FOR FEEDBACK!!
71. Negative Sampling
Problem
● Unary feedback: the best model will always predict “1” for each user and item.
● In general:
○ your model is used in real life to predict (user, item) outside sparse dataset.
○ can’t train on the full (#users x #items) dense matrix.
Negative Sampling Solution
● unary→binary (e.g. click/missing) binary→ternary (e.g. like/dislike/missing)
● sample strategy matters a lot (i.e. how to split train and valid)
● how many negative samples matters a lot
74. Underfitting and Overfitting – Take Home
(1) For doing cross-validation split data such as almost all users are in training and validation
(2) Use negative sampling to avoid overfitting in your models
(3) Always use learning curves to get more insights about underfitting vs overfitting
(4) Compute mean and variance of your predictions to get insights about underfitting vs overfitting
75. Loss Functions – Classification
● Equivalent to cross-entropy between the truth and the predicted probability (for 2-classes model)
● Equivalent to Kullback-Leibler divergence between the truth and the predicted probability
● Often used for deep-learning based recommendation engines
● Smooth gradient around zero and steep for large errors
Logistic