1. Cao et al. ICML 2010
Presented by Danushka Bollegala.
2. Predict links (relations) between entities
Recommend items for users (MovieLens, Amazon)
Recommend users for users (social recommendation)
Similarity search (suggest similar web pages)
Query suggestion (suggest related queries by other users)
Collective Link Prediction (CLP)
Perform multiple prediction tasks for the same set of users
simultaneously
▪ Predict/recommend multiple item types (books and movies)
Pros
Prediction tasks might not be independent, one can
benefit from another (books vs. movies vs. food)
Less affected by data sparseness (cold start problem)
3. Transfer Learning+
Collective Link Prediction
(this paper)
Gaussian
Process for Regression
(GPR)
(PRML Sec. 6.4)
Link prediction = matrix factorization
Probabilistic Principal
Component Analysis (PPCA)
(Bishop &Tipping, 1999)
PRML Chapter 12.
Probabilistic non-linear
matrix factorization
Lawrence &
Utrasun,
ICML 2009
Task similarity
Matrix,T
4. Link matrix X (xi,j is the rating given by user I to item j)
Xi,j is modeled by f(ui, vj, ε)
f: link function
ui: latent representation of a user i
vj: latent representation of an item j
ε: noise term
Generalized matrix approximation
Assumption: E is Gaussian noise N(0, σ2I)
Use Y = f-1(X)
Then, Y follows a multivariate Gaussian distribution.
6. We can view a function as an infinite dimensional
vector
f(x): (f(x1), f(x2),...)T
Each point in the domain is mapped by f to a dimension in
the vector
In machine learning we must find functions (e.g. linear
predictors) that map input values to their
corresponding output values
We must also avoid over-fitting
This can be visualized as sampling from a distribution
over functions with certain properties
Preference bias (cf. restriction bias)
7. Linear regression model
We get different output functions y for
different weight vectors w.
Let us impose a Gaussian prior over w
Train dataset: {(x1,y1),...,(xN,yN)}
Targets: y=(y1,...,yN)T
Design matrix
8. When we impose a Gaussian prior over the
weight vector, then the target y is also
Gaussian.
K: Kernel matrix (Gram matrix)
k: kernel function
9. Gaussian process is defined as a probability
distribution over functions y(x) such that the set
of values y(x) evaluated at an arbitrary set of
points x1,...,xN jointly have a Gaussian
distribution.
p(x1,...,xN) is Gaussian.
Often the mean is set to zero
Non-informative prior
Then the kernel function fully defines the GP.
Gaussian kernel:
Exponential Kernel:
11. PMF can be seen as a Gaussian Process with latent variables
(GP-LVM) [Lawrence & Utrasun ICML 2009]
Generalized matrix approximation model
Y=f-1(X) follows a multivariate Gaussian distribution
A Gaussian prior is set on U
Probabilistic PCA model by
Tipping & Bishop (1999)
Non-linear version
Mapping
back to X
12.
13. GP model for each task
A single model for all tasks
14. Known as Kronecker product for two
matrices (e.g., numpy,kron(a,b))
15. Each task might have a different rating
distribution.
c, α, b are parameters that must be estimated
from the data.
We can relax the constraint α > 0 if we have
no prior knowledge regarding the negativity
of the skewness of the rating distribution.
16. Similar to GPR prediction
Predicting y= g(x)
Predicting x
17. Compute the likelihood of the dataset
Use Stochastic Gradient Descent for
optimization
Non-convex optimization
Sensitive to initial conditions
18. Setting
Use each dataset and predict multiple items
Datasets
MovieLens
▪ 100000 ratings, 1-5 scale ratings, 943 users, 1682 movies, 5
popular genres
Book-Crossing
▪ 56148 ratings, 1-10 scale, 28503 users, 9909 books, 4 most
general Amazon book categories
Douban
▪ A social network-based recommendation serivce
▪ 10000 users, 200000 items
▪ Movies, books, music
19. Evaluation measure
Mean Absolute Error (MAE)
Baselines
I-GP: Independent Link Prediction using GP
CMF: Collective matrix factorization
▪ non GP, classical NMF
M-GP: Joint Link prediction using multi-relational GP
▪ Does not consider the similarity between tasks
Proposed method = CLP-GP
20. Note: (1) Smaller values are better
(2) with(+)/without(-) link function.
23. Romance and Drama are very similar
Action and Comedy are very dissimilar
24. Elegant model and well-written paper
Few parameters (latent space dimension k)
need to be specified
All other parameters can be learnt
Applicable to a wide range of tasks
Cons:
Computational complexity
▪ Predictions require kernel matrix inversion
▪ SGD updates might not converge
▪ The problem is non-convex...