This second meetup will be about training different models for our recommender system. We will review the simple models we can build as a baseline. After that, we will present the recommender system as an optimization problem and discuss different training losses. We will mention linear models and matrix factorization techniques. We will end the presentation with a simple introduction to non-linear models and deep learning.
2. Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
3. Recommender Systems from A to Z
Part 1: The Right Dataset
Part 2: Model Training
Part 3: Model Evaluation
Part 4: Real-Time Deployment
4. 1. Introduction
Optimization problem, linear regression and Stochastic Gradient Descent (SGD)
1. Baseline models
Global average, user average and item-item models
1. Basic linear models
Least Squares (LS)
Regularized Least Squares (RLS)
1. Matrix factorization
Matrix Factorization, analytical solution and numerical solution
1. Non-linear models
Basic and Complex Deep Learning model
6. Model training – Introduction
Explicit vs Implicit feedback
Explicit feedback
(users’ ratings)
Implicit feedback
(users’ clicks)
7. Model training – Introduction
Explicit vs Implicit feedback
Explicit feedback
(users’ ratings)
Implicit feedback
(users’ clicks)
Explicit feedback Implicit feedback
Example Domains Movies, Tv-Shows, Music Marketplaces, Businesses
Example Data type Like/Dislike, Stars Clicks, Play-time, Purchases
Complexity Clean, Costly, Easy to interpret Dirty, Cheap, Difficult to interpret
8. Model training – Introduction
Recommendation engine types
Recommendation
engine
Content-based
Collaborative-filtering
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
9. Model training – Introduction
Recommendation engine types
Recommendation
engine
Content-based
Collaborative-filtering
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Linear Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning
10. Model training – Introduction
Recommendation engine types
Recommendation
engine
Content-based
Collaborative-based
Hybrid engine
Memory-based
Model-based
Item-Item
User-User
User-Item
Model When? Linear Problem definition Solutions strategies
Content-based Item Cold start Least Square, Deep Learning
Item-Item n_users >> n_items Affinity Matrix
User-User n_user << n_items KNN, Affinity Matrix
User-Item Better performance Matrix Factorization, Deep Learning
12. Model training – Introduction - Optimization
Optimization problem (definitions)
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
13. Model training – Introduction - Optimization
Optimization problem (definitions)
Ratings of User #1
Embedding of User #1
Embedding of Item #1
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
Ratings of User #m
To Item #n
14. Model training – Introduction - Optimization
Optimization problem (definitions)
AVAILABLE DATASET
?
?
Sparse matrix of ratings with
m users and n items
Dense matrix of users embeddings
Dense matrix of items embeddings
15. Model training – Introduction - Optimization
Optimization problem (basic formulation with RMSE)
Our goal is to find U and I, such as the difference between each datapoint in R and and the product
between each user and item is minimal.
(or R)
16. Model training – Introduction - Optimization
Optimization problem (more complex formulation)
Content-based
Content-based with Regularization
17. Model training – Introduction - Optimization
Optimization problem (more complex formulation)
Content-based
Content-based with Regularization
Available data
Regularization to
avoid overfitting
18. Model training – Introduction - Optimization
Optimization problem (more complex formulation)
Content-based
Content-based with Regularization
Take home
● In content-based models we already know I (items features)
● We can find a linear solutions to this problem using Least Squares
Available data
Regularization to
avoid overfitting
19. Model training – Introduction - Optimization
Optimization problem (more complex formulation)
Collaborative-filtering
Collaborative-filtering with Regularization
20. Model training – Introduction - Optimization
Optimization problem (more complex formulation)
Collaborative-filtering
Collaborative-filtering with Regularization
Available data
Regularization to
avoid overfitting
21. Model training – Introduction - Optimization
Optimization problem (more complex formulation)
Collaborative-filtering
Collaborative-filtering with Regularization
Available data
Regularization to
avoid overfitting
Take home
● In collaborative-filtering we want to find U and I (users and items embeddings)
● We can find a linear solutions to this problem using Matrix Factorization and SGD
22. Model training – Introduction - Optimization
How to analytical solve an optimization problem?
Let’s start with the simple optimization problem: linear regression without regularization.
With m > n and. We want to find W such as:
23. Model training – Introduction - Optimization
How to analytical solve an optimization problem?
Let’s start with the simple optimization problem: linear regression without regularization.
With m > n and. We want to find W such as:
Add column of ones
to support w0
Scalar numbers
24. Model training – Introduction - Optimization
How to numerical solve an optimization problem?
Gradient descent: Start with random values for W and move in the opposite direction of the gradient
By taking just one sample
25. Model training – Introduction - Optimization
How to numerical solve an optimization problem?
Gradient descent: Start with random values for W and move in the opposite direction of the gradient
By taking just one sample
J(w)
26. Model training – Introduction - Optimization
Gradient Descent algorithm Stochastic Gradient Descent algorithm
for epoch in n_epochs:
● compute the predictions for all the samples
● compute the error between truth and predictions
● compute the gradient using all the samples
● update the parameters of the model
for epoch in n_epochs:
● shuffle the samples
● for sample in n_samples:
○ compute the predictions for the sample
○ compute the error between truth and
predictions
○ compute the gradient using the sample
○ update the parameters of the model
Mini-Batch Gradient Descent algorithm
for epoch in n_epochs:
● shuffle the batches
● for batch in n_batches:
○ compute the predictions for the batch
○ compute the error for the batch
○ compute the gradient for the batch
○ update the parameters of the model
27. Model training – Introduction - Optimization
Gradient Descent comparison
Gradient Descent Stochastic Gradient Descent Mini-Batch Gradient Descent
Gradient
Speed Very Fast (vectorized) Slow (compute sample by sample) Fast (vectorized)
Memory O(dataset) O(1) O(batch)
Convergence Needs more epochs Needs less epochs Middle point between GD and SGD
Gradient Stability Smooth updates in params Noisy updates in params Middle point between GD and SGD
28. Model training – Introduction - Optimization
A Problem with Implicit Feedback
With datasets with only unary positive feedback (e.g. clicks history)
Negative Sampling
Common fix: add random users and items with r=0
29. Model training – Introduction - Optimization
A Problem with Implicit Feedback
With datasets with only unary positive feedback (e.g. clicks history)
Negative Sampling
Common fix: add random users and items with r=0
Uniform distribution
Dataset
30. Model training – Introduction - Optimization
Negative Sampling
Common fix: add random users and items with rating=0
● Expresses “unknowns items” from users
● Acts as a regularizer
● Works also for explicit feedback
32. Model Training – Baseline models
Introduction
● Before starting to train models, always compute a baseline
● Baselines are very useful to debug more complex models
● As a general rule:
○ Very basic models can’t capture all the details on the training data and tend to underfit
○ Very complex models capture every detail on the training data and tend to overfit
● Note: During this presentation we will be using RMSE for comparing models performance
33. Model Training – Baseline models
Global Average
Average = 3.64
3.64
3.64
3.64
3.64
3.64
3.64
Prediction
RMSE = sqrt((2 - 3.64)^2 + (1-3.64)^2 + …)
RMSE = sqrt(4.13)
34. Model Training – Baseline models
Global average - Numpy code
importnumpyas np
from scipy.sparse import csr_matrix
rows= np.array([0,0,0,1,1,2,2,2,2,3,3,3,4,4,5,5,5])
cols = np.array([0,1,5,3,5,0,1,2,4,0,3,5,0,2,1,3,4])
data = np.array([2,5,4,1,5,2,4,5,4,4,5,1,5,2,1,4,2])
ratings= csr_matrix((data,(rows, cols)), shape=(6, 6))
idx = np.random.permutation(data.size)
idx_train = idx[0:int(idx.size*0.8)]
idx_valid = idx[int(idx.size*0.8):]
global_avg= data[idx_train].mean()
rmse = np.sqrt(((data[idx_valid]- global_avg)**2).sum())
35. Model Training – Baseline models
User average
Average u1 = 4.50
Average u2 = 5.00
Average u3 = 3.67
4.50
5.00
3.67
2.50
5.00
2.50
Prediction
RMSE = sqrt((2 - 4.5)^2 + (1-5.0)^2 ...)
RMSE = sqrt(6.15)
Average u4 = 2.50
Average u5 = 5.00
Average u6 = 2.50
39. Model Training – Basic linear models
Content Based - Standard Least Squares model
● Goal: very basic linear model
● Data: the matrix of items features I (may be sparse)
● Pre-processing: use PCA to reduce the dimension of I
● Solve:
● Solution is Least Squares:
40. Model Training – Basic linear models
Content Based - Standard Least Squares model
● Goal: very basic linear model
● Data: the matrix of items features I (may be sparse)
● Pre-processing: use PCA to reduce the dimension of I
● Solve:
● Solution is Least Squares:
Never compute the inverse!
(1) Use numpy:
numpy.linalg.solve(I*I.T, I*R.T)
(1) Use Cholesky decomposition:
(I * I.T) is a positive definite matrix!
41. Model Training – Basic linear models
Content Based - Regularized Least Squares model
● Goal: avoid overfitting
● Method: Tikhonov Regularization (a.k.a Ridge Regression)
● Solve:
● Solution is Regularized Least Squares:
43. Model Training – Matrix Factorization
Matrix Factorization
● If we don’t have I, to find a linear solution to our problem we need to use Matrix Factorization
techniques.
● Now we want to solve the following optimization problem:
SOLUTIONS
ANALYTICAL NUMERICAL
SVD ALS SGD
44. Model Training – Matrix Factorization
Matrix Factorization - Graphical interpretation
(or R)
45. Model Training – Matrix Factorization
Matrix Factorization - Graphical interpretation
46. Model Training – Matrix Factorization
Matrix Factorization - Graphical interpretation
47. Model Training – Matrix Factorization
Analytical solution - Singular Value Decomposition (SVD)
● Optimal Solution
● Closed Form, readily available in scikit-learn
● O(n^3) algorithm, does not scale
48. Model Training – Matrix Factorization
Numerical solution - Alternating Least Square (ALS)
Initialize:
Iterate:
● Solving least squares is easy
● Scales to big dataset
● Distributed implementation are available (e.g. on Spark)
49. Model Training – Matrix Factorization
Numerical solution - Stochastic Gradient Descent (SGD)
We are using SGD -> One sample each time
52. Model Training – Non-linear models
Simple Deep Learning model for collaborative filtering
53. Model Training – Non-linear models
Simple Deep Learning model for collaborative filtering
54. Model Training – Basic Deep Learning model
Simple Deep Learning model for collaborative filtering
55. Model Training – Complex Deep Learning problem
More complex Deep Learning model for collaborative filtering
56. Model Training – Complex Deep Learning problem
Training with Deep Learning
● Use Deep Learning Framework (e.g. PyTorch, TensorFlow)
● ...or at least Analytical Gradient Libraries (e.g. Theano, Chainer)
● Acceleration Heuristics (e.g. AdaGrad, Nesterov, RMSProp, Adam, NAdam)
● DropOut / BatchNorm
● Watch-out for Sparse Momentum Updates! Most Deep Learning frameworks don’t support it
● Hyper-parameter Optimization and Architecture Search (e.g. Gaussian Processes)
58. Model Training – Conclusions
Conclusions
Global Avg User Avg Item-Item Linear Linear + Reg Matrix Fact Deep Learning
Domains Baseline Baseline users >> items Known “I” Known “I” Unknown “I” Extra datasets
Model Complexity Trivial Trivial Simple Linear Linear Linear Non-linear
Time Complexity + + +++ ++++ ++++ ++++ ++
Overfit/Underfit Underfit Underfit May Underfit May Overfit May Perform Bad May Overfit Can Overfit
Hyper-Params 0 0 0 1 2 2–3 many
Implementation Numpy Numpy Numpy Numpy Numpy LightFM, Spark NNet libraries
59. Model Training – Conclusions
Take home
● Always start with the simplest, stupidest models
● Spend time on simple interpretable models to debug your codebase and clean your data
● Gradually increase the complexity of your models
● Add more regularization as soon as a complex model performs worse than a simpler model