In this lecture, I will first cover the recent advances in neural recommender systems such as autoencoder-based and MLP-based recommender systems. Then, I will introduce the recent achievement for automatic playlist continuation in music recommendation.
4. Search vs. Recommendation
How can we help users get access to relevant data?
Pull mode (search engines)
Users take initiative.
Ad-hoc information need
Push mode (recommender systems)
Systems take initiative.
Stable information need or a system
has user’s information need.
4
9. What is Collaborative Filtering?
Given a target user, Alice, find a set of users whose preference
patterns are similar to that of the target user.
Predict a list of items that Alice will be likely to prefer.
9
Target user: Alice
① Inferring Alice’s
preference
② Finding a set of users with
similar preference for Alice
③ Recommending a list of items
that a user group prefers
Top-N
Recommendation
10. User-Item Rating Matrix
A user-item rating matrix R of the target user, Alice, and
other users is given:
R: a user-item rating matrix (𝒎 × 𝒏 matrix)
Determine whether She would like or dislike movies, which
has not seen yet.
10
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
…
…
…
…
…
Alice
11. Latent Factor Models
How to model user-item interactions?
U: latent user matrix (𝒎 × 𝒌 matrix)
Each user is represented by a latent vector (1 × 𝑘 vector).
V: latent item matrix (𝒏 × 𝒌 matrix)
Each item is represented by a latent vector (1 × 𝑘 vector).
11
User-item interaction
𝒇 𝑼, 𝑽 = ?
…… ……
12. Factorizing Two Latent Matrices
The user-item rating matrix R can be approximated as a
linear combination of two latent matrices U and V.
R: user-item rating matrix (𝑚 × 𝑛 matrix)
U: latent user matrix (𝑚 × 𝑘 matrix)
V: latent item matrix (𝑛 × 𝑘 matrix)
𝑘: # of latent features
12
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
1.5 ... 0.1
0.6 ... 1.2
0.7 ... 0.5
… ... ...
0.1 ... 0.2
0.2 1.4 1.2 … 2.3
… ... … ... ...
0.1 2.6 0.3 … 1.5
…
…
k features
kfeatures
R U
VT
…
…
…
…
…
Yehuda Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” KDD 2008
13. Limitation of Existing Models
Existing models are mainly based on a linear user-item
interaction.
However, the user-item interaction may be non-linear and
non-trivial.
13
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
1.5 ... 0.1
0.6 ... 1.2
0.7 ... 0.5
… ... ...
0.1 ... 0.2
0.2 1.4 1.2 … 2.3
… ... … ... ...
0.1 2.6 0.3 … 1.5…
…
k features
kfeatures
R U
VT
…
…
…
…
…
15. Statistics of RecSys Models using DNNs
The number increases exponentially in the last five years.
SIGIR, WWW, RecSys, KDD, AAAI, WSDM, NIPS, …
15
Shuai Zhang et al., “Deep Learning based Recommender System: A Survey and New Perspectives,” 2017
16. Categories of RecSys Models using DNNs
16
Deep Learning based
Recommender System
Model using
Single DL
Technique
Integrate
DL with
Traditional
RS
Deep
Composite
Model
Recommend
Rely Solely
on DL
MLP
AE
CNN
RNN
DSSM
RBM
NADE
GAN
Loosely
Coupled
Model
Tightly
Coupled
Model
Integration ModelNeural Network Model
17. Categories of RecSys Models using DNNs
17
Deep Learning based
Recommender System
Model using
Single DL
Technique
Integrate
DL with
Traditional
RS
Deep
Composite
Model
Recommend
Rely Solely
on DL
MLP
AE
CNN
RNN
DSSM
RBM
NADE
GAN
Loosely
Coupled
Model
Tightly
Coupled
Model
Integration ModelNeural Network Model
18. AutoRec: Autoencoder-based Model
For each item, reconstruct rating vectors.
Observed ratings are only used to update the model.
18
2 1 3…
2 1 3…
…𝒉(𝒓)
𝒓
ො𝒓
𝑾, 𝒃
𝑾′
, 𝒃′
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
Suvash Sedhain et al. “AutoRec: Autoencoders Meet Collaborative Filtering,” WWW 2015
19. Denoising Autoencoder (DAE)
Learn to reconstruct a user’s favorite set
𝒓 of items from randomly sampled
subsets, i.e., denoising autoencoder.
19
3 2…
3 3 2…
…𝒉(𝒓)
𝒓
ො𝒓
𝑾, 𝒃
𝑾′
, 𝒃′
Denoising input
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
Yao Wu et al., “Collaborative Denoising Auto-Encoders for Top-N Recommender Systems,” WSDM 2016
20. Collaborative Denoising Autoencoder
Learn to reconstruct a user’s favorite set
𝒓 of items from randomly sampled
subsets, i.e., denoising autoencoder.
Train for all users by shared variables
for items and a specific variable for
each user, i.e., 𝒌 × 𝟏 vector.
20
3 2…
3 3 2…
…𝒉(𝒓)
𝒓
ො𝒓
𝑾, 𝒃
𝑾′
, 𝒃′
1 0 0…
𝑽 𝒌×𝟏
One-hot vector for user
Denoising input
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
21. Generalized Matrix Factorization (GMF)
Embeddings are used to learn latent user and item features.
Input: one-hot feature vector for user u and item i
Output: predicted score ŷui
Element-wise product is same as the existing MF model.
21
0 0 1 … 00 1 0 … 0
latent user vector Latent item vector
ෝ𝒚 𝒖𝒊
User embedding Item embedding
Element-wise product
Fully connected w/o bias
Layer
… …
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
…
R
…
…
…
…
…
Xiangnan He et al., “Neural Collaborative Filtering,” WWW 2017
22. Step 1: Embedding Users and Items
User embedding
Represent a latent feature for each user.
Item embedding
Similarly, it represents a latent feature for each item.
22
𝟎 𝟏 𝟎 … 𝟎 ×
𝟎. 𝟔 𝟎. 𝟖
𝟎. 𝟔 𝟎. 𝟒
𝟎. 𝟐 𝟎. 𝟓
⋯
𝟎. 𝟔
𝟏. 𝟐
𝟎. 𝟐
⋮ ⋱ ⋮
𝟎. 𝟕 𝟎. 𝟓 ⋯ 𝟎. 𝟑
= 𝟎. 𝟔 𝟎. 𝟒 … 𝟏. 𝟐
𝟏 × 𝒎 vector
𝒎 × 𝒌 matrix
𝟏 × 𝒌 vector
0 0 1 … 00 1 0 … 0
latent user vector Latent item vector
User embedding Item embedding
… …
30. Distributional Hypothesis
“You shall know a word by the company it
keeps.” by J.R. Firth (1957)
Words with similar contexts share similar meanings.
One of the most successful ideas of modern NLP
What is Tejuino ?
A cup of Tejuino is on the table.
A woman likes Tejuino.
Tejuino makes you drunk.
I usually drink Tejuino every morning.
30
31. Prod2Vec using Word Embeddings
How to perform product embedding?
Adopt the skip-gram model for products.
Input: i-th product purchased
by the user
Output: the other product
purchased by the user
31
A set of words A set of products purchased by the user
Mihajlo Grbovic et al., “E-commerce in Your Inbox: Product Recommendations at Scale,” KDD 2015
32. Prod2Vec using Word Embeddings
Imagine that the existing user-item matrix.
Word Movie, A set of words User
The window size is ignored.
32
3 3 ? 2
? ? 4 1
5 4 ? ?
3 ? ? 3
…
…
…
…
…
… i-th movie
watched by the user
Projection
Softmax
Transform
The other movie
watched by the user
33. Possible Models of Prod2Vec
33
i-th product
purchased by the user
Projection
Softmax
Transform
The other product
purchased by the user
i-th product purchased by the user
Projection
+ Averaging
Softmax
Transform
user products except for the i-th
product purchased by the user
Prod2Vec Skip-gram model User embedding + Prod2Vec
34. 34
About Time (2013)
드라마/판타지/성장
어떠한 순간을 다시 살게 된다면, 과연 완벽한 사랑을 이룰 수 있을까?
모태솔로 팀은 성인이 된 날, 아비저로부터 놀랄만한 가문의 비밀을
듣게 된다. 바로 사긴을 되돌릴 수 있는 능력이 있다는 것! 여자친구를
만든다는 꿈을 이루기 위해 런던으로 간 팀은 메리에게 첫눈에 반하게
되는데..
Silver Linings Playbook (2012)
드라마/코미디
Secret Life of Walter
Mitty, The (2013)
판타지/드라마
Perks of Being a
Wallflower, The (2012)
드라마/성장
The Theory of Everything(2014)
드라마/로맨스
What If (2013)
로맨스/코미디
Man Up (2015)
드라마/로맨스
Love, Rosie (2014)
로맨스/코미디
Two Night Stand (2014)
로맨스/코미디
GloVe
Skip-gram
35. Mulan (1998)
애니메이션
동양의 분위기를 살린 디즈니의 역작!
파씨 가문의 외동딸 뮬란은 선머슴 같은 성격 때문에 중매를 볼 때마다
퇴짜를 맞는다. 때마침 훈족의 침입으로 징집명령이 떨어지고 늙은
아버지를 대신하여 남장을 하고 나서는데..
Jungle Book, The (1967)
애니메이션
Antz (1998)
애니메이션
Lady and the Tramp (1955)
애니메이션
Peter Pan (1953)
애니메이션
Thumbelina (1994)
애니메이션
A Dinosaur's Story (1993)
애니메이션
Quest for Camelot (1998)
애니메이션
Return to Never Land (2002)
애니메이션
35
GloVe
Skip-gram
36. Machine, The (2013)
판타지/SF(로봇)
인간과 로봇 그 경계가 사라진다!
Signal, The (2014)
스릴러/SF (컴퓨터)
Zero Theorem, The (2013)
판타지/드라마(컴퓨터)
Autómata (2014)
스릴러/액션(로봇)
Cargo (2009)
스릴러/미스터리(우주)
신 냉전시대에 인간의 뇌 데이터를 바탕으로 탄생한 살인로봇 머신
에이바는 점차 인간의 감정을 느껴가고, 그녀를 주축으로 머신들은
인간과의 최후의 전쟁을 선포하는데…
the east(2013)
스릴러/액션(스파이)
Signal, The (2014)
스릴러/SF (컴퓨터)
Autómata (2014)
스릴러/액션(로봇)
Cargo (2009)
스릴러/미스터리(우주)
36
GloVe
Skip-gram
39. NARM: Attention-based Model
Recommend the next item in a given session.
Combine global and local information.
They are represented by RNNs.
39
Jing Li et al., “Neural Attentive Session-based Recommendation,” CIKM 2017
41. Step 2: Local Encoder in NARM
Capturing user’s main purpose
𝛼 𝑡𝑗 = 𝑞 𝒉 𝒕, ℎ𝑗 , where 𝒉 𝒕 is a latent vector for the last item.
𝐶𝑡
𝑙
= σ 𝑗=1
𝑡
𝛼 𝑡𝑗ℎ𝑗
𝑞 is an attention scoring function for ℎ 𝑡 and ℎ𝑗.
𝑞 ℎ 𝑡, ℎ𝑗 = 𝑣 𝑇
𝜎(𝐴1ℎ 𝑡 + 𝐴2ℎ𝑗)
41
42. Step 3: Decoder in NARM
Concatenated vector 𝑐𝑡 = 𝑐𝑡
𝑔
; 𝑐𝑡
𝑙
= [ℎ 𝑡
𝑔
; σ 𝑗−1
𝑡
𝛼 𝑡𝑗ℎ 𝑡
𝑙
]
Use an alternative bi-linear similarity function.
𝑆𝑖 = 𝑒𝑚𝑏𝑖
𝑇
𝐵𝑐𝑡 where 𝐵 is a 𝐷 × |𝐻| matrix.
|𝐷| is the dimension of each item.
42
43. Combining Attention and Memory
𝒎 𝒔 represents the average vector of items.
𝑚 𝑠 =
1
𝑡
σ𝑖=1
𝑡
𝑥𝑖
𝒎 𝒕 is the vector for last item.
43
Qiao Liu et al., “Short-Term Attention/Memory Priority Model for Session-based Recommendation,” KDD 2018
44. STAMP: Attention/Memory Priority
𝒎 𝒂 is sum of multiplication between coefficient and
embedding vector.
𝑚 𝑎 = σ𝑖=1
𝑡
𝛼𝑖 𝑥𝑖
Attention coefficient: 𝛼𝑖 = 𝑊0 𝜎(𝑊1 𝑥𝑖 + 𝑊2 𝒎 𝒔 + 𝑊3 𝒙 𝒕 + 𝑏)
44
45. MMCF: Multimodal Collaborative Filtering
for Automatic Playlist Continuation
RecSys Challenge 2018
Team ‘hello world!’ (2nd place), main track
Hojin Yang, Yoonki Jeong, Minjin Choi, and Jongwuk Lee
Sungkyunkwan University, Republic of Korea
46. Automatic Playlist Continuation
Million Playlist Dataset (MPD)
46
Playlist title
Tracks in the
playlist
Metadata of tracks (artist,
album)
47. Challenge Set
47
1 2 3 4 5 6 7 8 9 10
# of tracks 0 1 5 10 5 10 25 100 25 100
Title available Yes Yes Yes Yes No No Yes Yes Yes Yes
Track order Seq Seq Seq Seq Seq Seq Seq Seq Shuffled Shuffled
# of playlists 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000
Few tracks in the
first part
Many tracks in the
first part
Many tracks in the
random position
48. Challenge Set
48
1 2 3 4 5 6 7 8 9 10
# of tracks 0 1 5 10 5 10 25 100 25 100
Title available Yes Yes Yes Yes No No Yes Yes Yes Yes
Track order Seq Seq Seq Seq Seq Seq Seq Seq Shuffled Shuffled
# of playlists 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000
No tracks in the playlist
How to deal with an edge case?
49. Challenge Set
49
1 2 3 4 5 6 7 8 9 10
# of tracks 0 1 5 10 5 10 25 100 25 100
Title available Yes Yes Yes Yes No No Yes Yes Yes Yes
Track order Seq Seq Seq Seq Seq Seq Seq Seq Shuffled Shuffled
# of playlists 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000
Scarce information
How to treat playlists with scarce information?
50. Challenge Set
50
1 2 3 4 5 6 7 8 9 10
# of tracks 0 1 5 10 5 10 25 100 25 100
Title available Yes Yes Yes Yes No No Yes Yes Yes Yes
Track order Seq Seq Seq Seq Seq Seq Seq Seq Shuffled Shuffled
# of playlists 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000
Various types of Input
How to deal with various types of input?
51. Overview of the Proposed Model
An ensemble method with two components.
Autoencoder for tracks and metadata for tracks.
CharCNN for playlist titles.
51
52. Overview of the Proposed Model
An ensemble method with two components.
Autoencoder for tracks and metadata for tracks.
CharCNN for playlist titles.
52
55. Denoising Autoencoder
55
Training with denoising
Some positive input values are corrupted (set to zero).
How to utilize the metadata such
as artists and albums?
1
0
1
1
0
1
1
0
1
0
0
0
0.9
0.01
0.78
0.9
0. 6
0.8Hey Jude
Rehab
Yesterday
Dancing Queen
Mamma Mia
Viva la Vida
encoder decoder
1
0
1
1
0
1
denoising
60. CharCNN for Playlist Titles
An ensemble method with two components.
Autoencoder for tracks and metadata for tracks.
CharCNN for playlist titles.
60
61. Word-level CNN for NLP
Effective for capturing spatial locality of a sequence of texts
61
I like this
song very
much
0.1 0.3 0.2 0.6
0.2 0.6 -1.2 -0.2
-2.1 0.2 0.1 0.4
-2.1 0.9 -3.1 1.4
0.1 0.3 -0.2 0.1
0.4 0.1 0.7 0.1
I
like
this
song
very
Filter (3 by k )
2.2
2.3
-1.3
0.9
max
pooling
Conv layer
2.3
Feature
much
convolution
k-dimension embedding
62. Word-level CNN for NLP
Effective for capturing spatial locality of a sequence of texts
62
I like this
song very
much
0.1 0.3 0.2 0.6
0.2 0.6 -1.2 -0.2
-2.1 0.2 0.1 0.4
-2.1 0.9 -3.1 1.4
0.1 0.3 -0.2 0.1
0.4 0.1 0.7 0.1
I
like
this
song
very
Filters (3 by k )
convolution
2.2
2.3
-1.3
0.9
max
pooling
2.3
Feature
much
k-dimension embedding
Conv layer
2.2
2.3
-1.3
0.9
Conv layers
2.2
2.3
-1.3
0.9
1.2
2.4
-1.1
0.4
max
pooling
2.3
1.2
2.4
Feature vector
convolution
63. CharCNN for Playlist Titles
Playlist titles are represented by a short text, implying
an abstract description of a playlist.
Use character-level embedding.
63
Conv layers
Feature
vector
65. Combining Two Models
The accuracy of the AE highly relies on the number of tracks
within a playlist.
Dynamic: Set weights according to the number of items.
65
Items
Playlist Title
Chill songs
0.7
0.4
0.9
0.1
0.2
0.1
0.2
0.3
0.7
0.1
0.6
0.4
0.7
0.2
0.2
AE
CNN
𝑤_𝑖𝑡𝑒𝑚 = 5
𝑤_𝑡𝑖𝑡𝑙𝑒 = 1
69. SR-GNN using Graph Neural Nets
Each session graph is proceeded one by one. Node vectors
can be obtained through a gated graph neural network.
Each session is represented as the combination of the global
preference and current interests of this session using an
attention net.
69
Shu Wu et al., “Session-based Recommendation with Graph Neural Networks,” AAAI 2019