SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Neural Learning to Rank
Bhaskar Mitra
Principal Applied Scientist, Microsoft
PhD student, University College London
@UnderdogGeek
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Objectives
A quick recap of neural networks
The fundamentals of learning to rank
A quick recap of deep neural networks
Learning to rank with deep neural networks
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Reading material
An Introduction to
Neural Information Retrieval
Foundations and Trends® in Information Retrieval
(December 2018)
Download PDF: http://bit.ly/fntir-neural
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Most information retrieval
(IR) systems present a ranked
list of retrieved artifacts
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Learning to Rank (LTR)
”... the task to automatically construct a
ranking model using training data, such
that the model can sort new objects
according to their degrees of relevance,
preference, or importance.”
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
A quick recap of
neural networks
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Neural networks
Chains of parameterized linear transforms (e.g., multiply weight, add
bias) followed by non-linear functions (σ)
Popular choices for σ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕𝑙
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
=
𝜕 𝑦 − 𝑦2
2
𝜕𝑦2
×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑦2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
𝜕𝑦1
×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑦1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 ×
𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝜕𝑤1
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized
Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1)
𝜕𝑙
𝜕𝑤1
= −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2
𝑤1. 𝑥 + 𝑏1 × 𝑥
Update the parameter value based on the gradient with 𝜂 as the learning rate
𝑤1
𝑛𝑒𝑤
= 𝑤1
𝑜𝑙𝑑
− 𝜂 ×
𝜕𝑙
𝜕𝑤1
Stochastic Gradient Descent (SGD)
Task: regression
Training data: 𝑥, 𝑦 pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2
𝑥 𝑦1 𝑦2
𝑙
𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1
𝑦 − 𝑦2
2
𝑦
𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2
…and repeat
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Neural models for
non-ranking tasks
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
The softmax function
In neural classification models, the softmax function is popularly used
to normalize the neural network output scores across all the classes
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Cross entropy
The cross entropy between two
probability distributions 𝑝 and 𝑞
over a discrete set of events is
given by,
If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all
other values of 𝑖 then,
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Cross entropy with
softmax loss
Cross entropy with softmax is a popular loss
function for classification
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Questions?
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
The fundamentals of
learning to rank
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Problem formulation
LTR models represent a rankable item—e.g., a document or a movie or a
song—given some context—e.g., a user-issued query or user’s historical
interactions with other items—as a numerical vector 𝑥 ∈ ℝ 𝑛
The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued
score such that relevant items are scored higher.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Examples of ranking metrics
Discounted Cumulative Gain (DCG)
𝐷𝐶𝐺@𝑘 =
𝑖=1
𝑘
2 𝑟𝑒𝑙𝑖
− 1
𝑙𝑜𝑔2 𝑖 + 1
Reciprocal Rank (RR)
𝑅𝑅@𝑘 = max
1<𝑖<𝑘
𝑟𝑒𝑙𝑖
𝑖
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Why is ranking challenging?
Rank based metrics, such as DCG or MRR, are
non-smooth/non-differentiable
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Approaches
Pointwise approach
Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑.
Pairwise approach
Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCG—difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Features
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Features
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Pointwise objectives
Regression loss
Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑
e.g., square loss for binary or categorical
labels,
where, 𝑦 𝑞,𝑑 is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Pointwise objectives
Classification loss
Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑
e.g., cross-entropy with softmax over
categorical labels 𝑌 [Li et al., 2008],
where, 𝑠 𝑦 𝑞,𝑑
is the model’s score for label 𝑦 𝑞,𝑑
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009],
where, 𝜙 can be,
• Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000]
• Exponential function 𝜙 𝑧 = 𝑒−𝑧
[Freund et al., 2003]
• Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧
[Burges et al., 2005]
• Others…
Pairwise loss minimizes the average number of
inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is
ranked higher than 𝑑𝑖
Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document
For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 ,
Feature vectors: 𝑥𝑖 and 𝑥𝑗
Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Pairwise objectives
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]—an industry favourite
[Burges, 2015]
Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡
𝑒 𝛾.𝑠 𝑖
𝑒 𝛾.𝑠 𝑖 +𝑒
𝛾.𝑠 𝑗
=
1
1+𝑒
−𝛾. 𝑠 𝑖−𝑠 𝑗
Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0
Computing cross-entropy between 𝑝 and 𝑝
ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
A generalized cross-entropy loss
An alternative loss function assumes a single relevant document 𝑑+ and compares it
against the full collection 𝐷
Predicted probabilities: p 𝑑+|𝑞 =
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
The cross-entropy loss is then given by,
ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔
𝑒 𝛾.𝑠 𝑞,𝑑+
𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑
Computing the softmax over the full collection is prohibitively expensive—LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Listwise objectives
Burges et al. [2006] make two observations:
1. To train a model we don’t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Listwise objectives
According to the Luce model [Luce, 2005],
given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability
of observing a particular rank-order, say
𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by:
where, 𝜋 is a particular permutation and 𝜙 is a
transformation (e.g., linear, exponential, or
sigmoid) over the score 𝑠𝑖 corresponding to
item 𝑑𝑖
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Listwise objectives
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a “smooth” rank of
documents as a function of their scores
This “smooth” rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Questions?
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
A quick recap of
deep neural networks
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Types of vector representations
Local (or one-hot) representation
Every term in vocabulary T is represented by a
binary vector of length |T|, where one position in
the vector is set to one and the rest to zero
Distributed representation
Every term in vocabulary T is represented by a
real-valued vector of length k. The vector can be
sparse or dense. The vector dimensions may be
observed (e.g., hand-crafted features) or latent
(e.g., embedding dimensions).
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Different modalities of input text representation
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Different modalities of input text representation
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Different modalities of input text representation
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Different modalities of input text representation
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Shift-invariant
neural operations
Detecting a pattern in one part of the input space is similar to
detecting it in another
Leverage redundancy by moving a window over the whole
input space and then aggregate
On each instance of the window a kernel—also known as a
filter or a cell—is applied
Different aggregation strategies lead to different architectures
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Convolution
Move the window over the input space each time applying
the same cell over the window
A typical cell operation can be,
ℎ = 𝜎 𝑊𝑋 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels]
Cell Output [1 x out_channels]
Full Output [1 + (words – window) / stride x out_channels]
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Pooling
Move the window over the input space each time applying an
aggregate function over each dimension in within the window
ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗
Full Input [words x channels]
Cell Input [window x channels]
Cell Output [1 x channels]
Full Output [1 + (words – window) / stride x channels]
max -pooling average -pooling
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Convolution w/
Global Pooling
Stacking a global pooling layer on top of a convolutional layer
is a common strategy for generating a fixed length embedding
for a variable length text
Full Input [words x in_channels]
Full Output [1 x out_channels]
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Recurrent neural
network
Similar to a convolution layer but additional dependency on
previous hidden state
A simple cell operation shown below but others like LSTM and
GRUs are more popular in practice,
ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏
Full Input [words x in_channels]
Cell Input [window x in_channels] + [1 x out_channels]
Cell Output [1 x out_channels]
Full Output [1 x out_channels]
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Questions?
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Learning to rank with
deep neural networks
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Representation
learning for IR
Many IR scenarios—e.g., web search and content-based
filtering for recommender systems—involve matching
items based on their descriptions
Deep learning models can be useful
for learning good representation of
items for matching
i.e., LTR with raw inputs instead of
hand-engineered features
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
DNN for YouTube
recommendation
Input: user profile
Output: probability distribution over
items to be recommended
Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys, 2016.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Siamese networks
Relevance estimated by cosine similarity
between item embeddings
Input: character trigraph counts (bag of
words assumption)
Minimizes cross-entropy loss against
randomly sampled negative documents
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Wide and deep model
Deep model for representation
learning and wide model for
memorization
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, et al. Wide & deep learning for recommender systems. In workshop on deep learning for recommender systems, 2016.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Lexical and semantic
matching networks
Mitra et al. [2016] argue that both lexical and
semantic matching is important for
document ranking
Duet model is a linear combination of two
DNNs—focusing on lexical and semantic
matching, respectively—jointly trained on
labelled data
code: http://bit.ly/duetv2code
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
GET THE CODE
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Large scale pretrained
language models
BERT (and other large-scale unsupervised
language models) are demonstrating dramatic
performance improvements on many IR tasks
Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018.
Nogueira, Rodrigo, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Dealing with
multiple fields
Real world items have multiple sources of
descriptions—creates unique challenges for
representation learning models
Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018.
Juan Li, Zhicheng Dou, Yutao Zhu, Xiaochen Zuo, and Ji-Rong Wen. Deep cross-platform product matching in e-commerce. In IRJ, 2019.
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Questions?
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Key takeaways
• Learning to Rank is effective on a broad range of IR tasks
• Optimization of non-smooth rank-based metrics is challenging but loss functions, such
as RankNet and LambdaRank, are effective in practice
• Learning to rank models can operate over hand-engineered features or employ deep
architectures to learn useful representations from raw input
• Large scale pretraining of language models demonstrate strong performance on tasks
that involve text matching
Workshop on Recommender Systems
HEC Montréal, August 20-23, 2019
Thank you
@UnderdogGeek bmitra@microsoft.com

Weitere ähnliche Inhalte

Was ist angesagt?

Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Toru Fujino
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringAnna Fensel
 
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advancedHouw Liong The
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...Shuhei Yoshida
 
Getting Started with Machine Learning
Getting Started with Machine LearningGetting Started with Machine Learning
Getting Started with Machine LearningHumberto Marchezi
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means ClusteringSajib Sen
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines SimplyEmad Nabil
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learningmooopan
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering SimplyEmad Nabil
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2arogozhnikov
 

Was ist angesagt? (20)

Introduction to data mining and machine learning
Introduction to data mining and machine learningIntroduction to data mining and machine learning
Introduction to data mining and machine learning
 
Lect4
Lect4Lect4
Lect4
 
Cluster Analysis for Dummies
Cluster Analysis for DummiesCluster Analysis for Dummies
Cluster Analysis for Dummies
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Machine learning
Machine learningMachine learning
Machine learning
 
Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)Dual Learning for Machine Translation (NIPS 2016)
Dual Learning for Machine Translation (NIPS 2016)
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advanced
 
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gen...
 
Getting Started with Machine Learning
Getting Started with Machine LearningGetting Started with Machine Learning
Getting Started with Machine Learning
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Kmeans
KmeansKmeans
Kmeans
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Clustering
ClusteringClustering
Clustering
 
K-Means Clustering Simply
K-Means Clustering SimplyK-Means Clustering Simply
K-Means Clustering Simply
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
 

Ähnlich wie Neural Learning to Rank

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1arogozhnikov
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxRakshaAgrawal21
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCRakshaAgrawal21
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401butest
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientFabian Pedregosa
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmIRJET Journal
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackarogozhnikov
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical EquationsIRJET Journal
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTIRJET Journal
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Introduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersIntroduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersZoran Sevarac, PhD
 

Ähnlich wie Neural Learning to Rank (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
 
Dive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSCDive into Machine Learning Event--MUGDSC
Dive into Machine Learning Event--MUGDSC
 
Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401Machine Learning: Foundations Course Number 0368403401
Machine Learning: Foundations Course Number 0368403401
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Hyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradientHyperparameter optimization with approximate gradient
Hyperparameter optimization with approximate gradient
 
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree AlgorithmWater Quality Index Calculation of River Ganga using Decision Tree Algorithm
Water Quality Index Calculation of River Ganga using Decision Tree Algorithm
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
fINAL ML PPT.pptx
fINAL ML PPT.pptxfINAL ML PPT.pptx
fINAL ML PPT.pptx
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
AMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLTAMAZON STOCK PRICE PREDICTION BY USING SMLT
AMAZON STOCK PRICE PREDICTION BY USING SMLT
 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Introduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java DevelopersIntroduction to Machine Learning for Java Developers
Introduction to Machine Learning for Java Developers
 

Mehr von Bhaskar Mitra

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?Bhaskar Mitra
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
 

Mehr von Bhaskar Mitra (20)

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
 
What’s next for deep learning for Search?
What’s next for deep learning for Search?What’s next for deep learning for Search?
What’s next for deep learning for Search?
 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
 
The Duet model
The Duet modelThe Duet model
The Duet model
 

Kürzlich hochgeladen

Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 

Kürzlich hochgeladen (20)

Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 

Neural Learning to Rank

  • 1. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Neural Learning to Rank Bhaskar Mitra Principal Applied Scientist, Microsoft PhD student, University College London @UnderdogGeek Workshop on Recommender Systems HEC Montréal, August 20-23, 2019
  • 2. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Objectives A quick recap of neural networks The fundamentals of learning to rank A quick recap of deep neural networks Learning to rank with deep neural networks
  • 3. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Reading material An Introduction to Neural Information Retrieval Foundations and Trends® in Information Retrieval (December 2018) Download PDF: http://bit.ly/fntir-neural
  • 4. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Most information retrieval (IR) systems present a ranked list of retrieved artifacts
  • 5. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Learning to Rank (LTR) ”... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  • 6. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 A quick recap of neural networks
  • 7. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Neural networks Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (σ) Popular choices for σ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  • 8. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕𝑙 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 9. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = 𝜕 𝑦 − 𝑦2 2 𝜕𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 10. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑦2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 11. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 𝜕𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 𝜕𝑦1 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 12. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑦1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 13. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 𝜕𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝜕𝑤1 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 14. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Goal: iteratively update the learnable parameters such that the loss 𝑙 is minimized Compute the gradient of the loss 𝑙 w.r.t. each parameter (e.g., 𝑤1) 𝜕𝑙 𝜕𝑤1 = −2 × 𝑦 − 𝑦2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤2. 𝑥 + 𝑏2 × 𝑤2 × 1 − 𝑡𝑎𝑛ℎ2 𝑤1. 𝑥 + 𝑏1 × 𝑥 Update the parameter value based on the gradient with 𝜂 as the learning rate 𝑤1 𝑛𝑒𝑤 = 𝑤1 𝑜𝑙𝑑 − 𝜂 × 𝜕𝑙 𝜕𝑤1 Stochastic Gradient Descent (SGD) Task: regression Training data: 𝑥, 𝑦 pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: 𝑤1, 𝑏1, 𝑤2, 𝑏2 𝑥 𝑦1 𝑦2 𝑙 𝑡𝑎𝑛ℎ 𝑤1. 𝑥 + 𝑏1 𝑦 − 𝑦2 2 𝑦 𝑡𝑎𝑛ℎ 𝑤2. 𝑦1 + 𝑏2 …and repeat
  • 15. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Neural models for non-ranking tasks
  • 16. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 The softmax function In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  • 17. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Cross entropy The cross entropy between two probability distributions 𝑝 and 𝑞 over a discrete set of events is given by, If 𝑝 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 1and 𝑝𝑖 = 0 for all other values of 𝑖 then,
  • 18. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Cross entropy with softmax loss Cross entropy with softmax is a popular loss function for classification
  • 19. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Questions?
  • 20. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 The fundamentals of learning to rank
  • 21. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Problem formulation LTR models represent a rankable item—e.g., a document or a movie or a song—given some context—e.g., a user-issued query or user’s historical interactions with other items—as a numerical vector 𝑥 ∈ ℝ 𝑛 The ranking model 𝑓: 𝑥 → ℝ is trained to map the vector to a real-valued score such that relevant items are scored higher.
  • 22. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Examples of ranking metrics Discounted Cumulative Gain (DCG) 𝐷𝐶𝐺@𝑘 = 𝑖=1 𝑘 2 𝑟𝑒𝑙𝑖 − 1 𝑙𝑜𝑔2 𝑖 + 1 Reciprocal Rank (RR) 𝑅𝑅@𝑘 = max 1<𝑖<𝑘 𝑟𝑒𝑙𝑖 𝑖
  • 23. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Why is ranking challenging? Rank based metrics, such as DCG or MRR, are non-smooth/non-differentiable
  • 24. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Approaches Pointwise approach Relevance label 𝑦 𝑞,𝑑 is a number—derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict 𝑦 𝑞,𝑑 given 𝑥 𝑞,𝑑. Pairwise approach Pairwise preference between documents for a query (𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCG—difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  • 25. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Features They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  • 26. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Features Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  • 27. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Pointwise objectives Regression loss Given 𝑞, 𝑑 predict the value of 𝑦 𝑞,𝑑 e.g., square loss for binary or categorical labels, where, 𝑦 𝑞,𝑑 is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
  • 28. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Pointwise objectives Classification loss Given 𝑞, 𝑑 predict the class 𝑦 𝑞,𝑑 e.g., cross-entropy with softmax over categorical labels 𝑌 [Li et al., 2008], where, 𝑠 𝑦 𝑞,𝑑 is the model’s score for label 𝑦 𝑞,𝑑 labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  • 29. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009], where, 𝜙 can be, • Hinge function 𝜙 𝑧 = 𝑚𝑎𝑥 0, 1 − 𝑧 [Herbrich et al., 2000] • Exponential function 𝜙 𝑧 = 𝑒−𝑧 [Freund et al., 2003] • Logistic function 𝜙 𝑧 = 𝑙𝑜𝑔 1 + 𝑒−𝑧 [Burges et al., 2005] • Others… Pairwise loss minimizes the average number of inversions in ranking—i.e., 𝑑𝑖 ≻ 𝑑𝑗 w.r.t. 𝑞 but 𝑑𝑗 is ranked higher than 𝑑𝑖 Given 𝑞, 𝑑𝑖, 𝑑𝑗 , predict the more relevant document For 𝑞, 𝑑𝑖 and 𝑞, 𝑑𝑗 , Feature vectors: 𝑥𝑖 and 𝑥𝑗 Model scores: 𝑠𝑖 = 𝑓 𝑥𝑖 and 𝑠𝑗 = 𝑓 𝑥𝑗 Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  • 30. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Pairwise objectives RankNet loss Pairwise loss function proposed by Burges et al. [2005]—an industry favourite [Burges, 2015] Predicted probabilities: 𝑝𝑖𝑗 = 𝑝 𝑠𝑖 > 𝑠𝑗 ≡ 𝑒 𝛾.𝑠 𝑖 𝑒 𝛾.𝑠 𝑖 +𝑒 𝛾.𝑠 𝑗 = 1 1+𝑒 −𝛾. 𝑠 𝑖−𝑠 𝑗 Desired probabilities: 𝑝𝑖𝑗 = 1 and 𝑝𝑗𝑖 = 0 Computing cross-entropy between 𝑝 and 𝑝 ℒ 𝑅𝑎𝑛𝑘𝑁𝑒𝑡 = − 𝑝𝑖𝑗. 𝑙𝑜𝑔 𝑝𝑖𝑗 − 𝑝𝑗𝑖. 𝑙𝑜𝑔 𝑝𝑗𝑖 = −𝑙𝑜𝑔 𝑝𝑖𝑗 = 𝑙𝑜𝑔 1 + 𝑒−𝛾. 𝑠 𝑖−𝑠 𝑗 pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  • 31. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 A generalized cross-entropy loss An alternative loss function assumes a single relevant document 𝑑+ and compares it against the full collection 𝐷 Predicted probabilities: p 𝑑+|𝑞 = 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 The cross-entropy loss is then given by, ℒ 𝐶𝐸 𝑞, 𝑑+, 𝐷 = −𝑙𝑜𝑔 p 𝑑+|𝑞 = −𝑙𝑜𝑔 𝑒 𝛾.𝑠 𝑞,𝑑+ 𝑑∈𝐷 𝑒 𝛾.𝑠 𝑞,𝑑 Computing the softmax over the full collection is prohibitively expensive—LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 32. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  • 33. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Listwise objectives Burges et al. [2006] make two observations: 1. To train a model we don’t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  • 34. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Listwise objectives According to the Luce model [Luce, 2005], given four items 𝑑1, 𝑑2, 𝑑3, 𝑑4 the probability of observing a particular rank-order, say 𝑑2, 𝑑1, 𝑑4, 𝑑3 , is given by: where, 𝜋 is a particular permutation and 𝜙 is a transformation (e.g., linear, exponential, or sigmoid) over the score 𝑠𝑖 corresponding to item 𝑑𝑖 R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  • 35. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Listwise objectives Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a “smooth” rank of documents as a function of their scores This “smooth” rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss
  • 36. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Questions?
  • 37. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 A quick recap of deep neural networks
  • 38. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Types of vector representations Local (or one-hot) representation Every term in vocabulary T is represented by a binary vector of length |T|, where one position in the vector is set to one and the rest to zero Distributed representation Every term in vocabulary T is represented by a real-valued vector of length k. The vector can be sparse or dense. The vector dimensions may be observed (e.g., hand-crafted features) or latent (e.g., embedding dimensions).
  • 39. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Different modalities of input text representation
  • 40. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Different modalities of input text representation
  • 41. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Different modalities of input text representation
  • 42. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Different modalities of input text representation
  • 43. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Shift-invariant neural operations Detecting a pattern in one part of the input space is similar to detecting it in another Leverage redundancy by moving a window over the whole input space and then aggregate On each instance of the window a kernel—also known as a filter or a cell—is applied Different aggregation strategies lead to different architectures
  • 44. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Convolution Move the window over the input space each time applying the same cell over the window A typical cell operation can be, ℎ = 𝜎 𝑊𝑋 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] Cell Output [1 x out_channels] Full Output [1 + (words – window) / stride x out_channels]
  • 45. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Pooling Move the window over the input space each time applying an aggregate function over each dimension in within the window ℎ𝑗 = 𝑚𝑎𝑥𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 𝑜𝑟 ℎ𝑗 = 𝑎𝑣𝑔𝑖∈𝑤𝑖𝑛 𝑋𝑖,𝑗 Full Input [words x channels] Cell Input [window x channels] Cell Output [1 x channels] Full Output [1 + (words – window) / stride x channels] max -pooling average -pooling
  • 46. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Convolution w/ Global Pooling Stacking a global pooling layer on top of a convolutional layer is a common strategy for generating a fixed length embedding for a variable length text Full Input [words x in_channels] Full Output [1 x out_channels]
  • 47. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Recurrent neural network Similar to a convolution layer but additional dependency on previous hidden state A simple cell operation shown below but others like LSTM and GRUs are more popular in practice, ℎ𝑖 = 𝜎 𝑊𝑋𝑖 + 𝑈ℎ𝑖−1 + 𝑏 Full Input [words x in_channels] Cell Input [window x in_channels] + [1 x out_channels] Cell Output [1 x out_channels] Full Output [1 x out_channels]
  • 48. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Questions?
  • 49. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Learning to rank with deep neural networks
  • 50. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Representation learning for IR Many IR scenarios—e.g., web search and content-based filtering for recommender systems—involve matching items based on their descriptions Deep learning models can be useful for learning good representation of items for matching i.e., LTR with raw inputs instead of hand-engineered features
  • 51. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 DNN for YouTube recommendation Input: user profile Output: probability distribution over items to be recommended Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In RecSys, 2016.
  • 52. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Siamese networks Relevance estimated by cosine similarity between item embeddings Input: character trigraph counts (bag of words assumption) Minimizes cross-entropy loss against randomly sampled negative documents Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
  • 53. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Wide and deep model Deep model for representation learning and wide model for memorization Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, et al. Wide & deep learning for recommender systems. In workshop on deep learning for recommender systems, 2016.
  • 54. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Lexical and semantic matching networks Mitra et al. [2016] argue that both lexical and semantic matching is important for document ranking Duet model is a linear combination of two DNNs—focusing on lexical and semantic matching, respectively—jointly trained on labelled data code: http://bit.ly/duetv2code Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017. GET THE CODE
  • 55. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Large scale pretrained language models BERT (and other large-scale unsupervised language models) are demonstrating dramatic performance improvements on many IR tasks Jacob Devlin, Ming-Wei Chang, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2018. Nogueira, Rodrigo, and Kyunghyun Cho. Passage Re-ranking with BERT. In arXiv, 2019.
  • 56. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Dealing with multiple fields Real world items have multiple sources of descriptions—creates unique challenges for representation learning models Hamed Zamani, Bhaskar Mitra, Xia Song, Nick Craswell, and Saurabh Tiwary. Neural ranking models with multiple document fields. In WSDM, 2018. Juan Li, Zhicheng Dou, Yutao Zhu, Xiaochen Zuo, and Ji-Rong Wen. Deep cross-platform product matching in e-commerce. In IRJ, 2019.
  • 57. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Questions?
  • 58. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Key takeaways • Learning to Rank is effective on a broad range of IR tasks • Optimization of non-smooth rank-based metrics is challenging but loss functions, such as RankNet and LambdaRank, are effective in practice • Learning to rank models can operate over hand-engineered features or employ deep architectures to learn useful representations from raw input • Large scale pretraining of language models demonstrate strong performance on tasks that involve text matching
  • 59. Workshop on Recommender Systems HEC Montréal, August 20-23, 2019 Thank you @UnderdogGeek bmitra@microsoft.com