SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Neural Learning to Rank
Bhaskar Mitra
Principal Applied Scientist, Microsoft
PhD candidate, University College London
@UnderdogGeek
Topics
A quick recap of neural networks
The fundamentals of learning to rank
Reading material
An Introduction to
Neural Information Retrieval
Foundations and Trendsยฎ in Information Retrieval
(December 2018)
Download PDF: http://bit.ly/fntir-neural
Most information retrieval
(IR) systems present a ranked
list of retrieved artifacts
Learning to Rank (LTR)
โ€... the task to automatically construct a
ranking model using training data, such
that the model can sort new objects
according to their degrees of relevance,
preference, or importance.โ€
- Liu [2009]
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
A quick recap of
neural networks
Vectors, matrices,
and tensors
Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66
Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
matrix transpose matrix addition
dot product matrix multiplication
Supervised learning
Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
Neural networks
Chains of parameterized linear transforms (e.g., multiply weight, add
bias) followed by non-linear functions (ฯƒ)
Popular choices for ฯƒ:
Parameters trained using backpropagation
E2E training over millions of samples in batched mode
Many choices of architecture and hyper-parameters
Non-linearity
Input
Linear transform
Non-linearity
Linear transform
Predicted output
forwardpass
backwardpass
Expected output
loss
Tanh ReLU
Basic machine
learning tasks
Squared loss
The squared loss is a popular loss function for regression tasks
The softmax function
In neural classification models, the softmax function is popularly used
to normalize the neural network output scores across all the classes
Cross entropy
The cross entropy between two
probability distributions ๐‘ and ๐‘ž
over a discrete set of events is
given by,
If ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก = 1and ๐‘๐‘– = 0 for all
other values of ๐‘– then,
Cross entropy with
softmax loss
Cross entropy with softmax is a popular loss
function for classification
We are given training data: < ๐‘ฅ, ๐‘ฆ > pairs, where ๐‘ฅ is input and ๐‘ฆ is expected output
Step 1: Define model and randomly initialize learnable model parameters
Step 2: Given ๐‘ฅ, compute model output
Step 3: Given model output and ๐‘ฆ, compute loss ๐‘™
Step 4: Compute gradient
๐œ•๐‘™
๐œ•๐‘ค
of loss ๐‘™ w.r.t. each parameter ๐‘ค
Step 5: Update parameter as ๐‘ค ๐‘›๐‘’๐‘ค = ๐‘ค ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค
, where ๐œ‚ is learning rate
Step 6: Go back to step 2 and repeat till convergence
Gradient Descent
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
=
๐œ•๐‘™
๐œ•๐‘ฆ2
ร—
๐œ•๐‘ฆ2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
=
๐œ• ๐‘ฆ โˆ’ ๐‘ฆ2
2
๐œ•๐‘ฆ2
ร—
๐œ•๐‘ฆ2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร—
๐œ•๐‘ฆ2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร—
๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
๐œ•๐‘ฆ1
ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร—
๐œ•๐‘ฆ1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร—
๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐œ•๐‘ค1
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized
Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1)
๐œ•๐‘™
๐œ•๐‘ค1
= โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2
๐‘ค1. ๐‘ฅ + ๐‘1 ร— ๐‘ฅ
Update the parameter value based on the gradient with ๐œ‚ as the learning rate
๐‘ค1
๐‘›๐‘’๐‘ค
= ๐‘ค1
๐‘œ๐‘™๐‘‘
โˆ’ ๐œ‚ ร—
๐œ•๐‘™
๐œ•๐‘ค1
Gradient Descent
Task: regression
Training data: ๐‘ฅ, ๐‘ฆ pairs
Model: NN (1 feature, 1 hidden layer, 1 hidden node)
Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2
๐‘ฅ ๐‘ฆ1 ๐‘ฆ2
๐‘™
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1
๐‘ฆ โˆ’ ๐‘ฆ2
2
๐‘ฆ
๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2
โ€ฆand repeat
Exercise
Simple Neural Network from Scratch
Implement a simple multi-layer neural network
with single input feature, single output, and
single neuron per layer using (i) PyTorch and
(ii) from scratchโ€”and demonstrate that both
approaches produce identical outcome.
https://github.com/spacemanidol/AFIRMDeep
Learning2020/blob/master/NNPrimer.ipynb
Computation
Networks
The โ€œLegoโ€ approach to specifying neural architectures
Library of neural layers, each layer defines logic for:
1. Forward pass: compute layer output given layer input
2. Backward pass:
a) compute gradient of layer output w.r.t. layer inputs
b) compute gradient of layer output w.r.t. layer parameters (if any)
Chain nodes to create bigger and more complex networks
Why adding depth helps
http://playground.tensorflow.org
Bias-Variance trade-
off
https://medium.com/@akgone38/what-the-heck-bias-variance-tradeoff-is-fe4681c0e71b
Bias-variance trade-off in the deep
learning era
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical biasโ€“variance trade-off. In PNAS, 2019.
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019.
The lottery ticket
hypothesis
Questions?
The fundamentals of
learning to rank
Problem formulation
LTR models represent a rankable itemโ€”e.g., a document or a movie or a
songโ€”given some contextโ€”e.g., a user-issued query or userโ€™s historical
interactions with other itemsโ€”as a numerical vector ๐‘ฅ โˆˆ โ„ ๐‘›
The ranking model ๐‘“: ๐‘ฅ โ†’ โ„ is trained to map the vector to a real-valued
score such that relevant items are scored higher.
Why is ranking challenging?
Examples of ranking
metrics
Discounted Cumulative Gain (DCG)
๐ท๐ถ๐บ@๐‘˜ =
๐‘–=1
๐‘˜
2 ๐‘Ÿ๐‘’๐‘™๐‘–
โˆ’ 1
๐‘™๐‘œ๐‘”2 ๐‘– + 1
Reciprocal Rank (RR)
๐‘…๐‘…@๐‘˜ = max
1<๐‘–<๐‘˜
๐‘Ÿ๐‘’๐‘™๐‘–
๐‘–
Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
Features
They can often be categorized as:
Query-independent or static features
e.g., incoming link count and document length
Query-dependent or dynamic features
e.g., BM25
Query-level features
e.g., query length
Traditional L2R models employ
hand-crafted features that
encode IR insights
Features
Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
Approaches
Pointwise approach
Relevance label ๐‘ฆ ๐‘ž,๐‘‘ is a numberโ€”derived from binary or graded human
judgments or implicit user feedback (e.g., CTR). Typically, a regression or
classification model is trained to predict ๐‘ฆ ๐‘ž,๐‘‘ given ๐‘ฅ ๐‘ž,๐‘‘.
Pairwise approach
Pairwise preference between documents for a query (๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž) as
label. Reduces to binary classification to predict more relevant document.
Listwise approach
Directly optimize for rank-based metric, such as NDCGโ€”difficult because
these metrics are often not differentiable w.r.t. model parameters.
Liu [2009] categorizes
different LTR approaches
based on training objectives:
Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
Pointwise objectives
Regression loss
Given ๐‘ž, ๐‘‘ predict the value of ๐‘ฆ ๐‘ž,๐‘‘
e.g., square loss for binary or categorical
labels,
where, ๐‘ฆ ๐‘ž,๐‘‘ is the one-hot representation
[Fuhr, 1989] or the actual value [Cossock and
Zhang, 2006] of the label
Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989.
David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006.
labels
prediction
0 1 1
Pointwise objectives
Classification loss
Given ๐‘ž, ๐‘‘ predict the class ๐‘ฆ ๐‘ž,๐‘‘
e.g., cross-entropy with softmax over
categorical labels ๐‘Œ [Li et al., 2008],
where, ๐‘  ๐‘ฆ ๐‘ž,๐‘‘
is the modelโ€™s score for label ๐‘ฆ ๐‘ž,๐‘‘
labels
prediction
0 1
Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009],
where, ๐œ™ can be,
โ€ข Hinge function ๐œ™ ๐‘ง = ๐‘š๐‘Ž๐‘ฅ 0, 1 โˆ’ ๐‘ง [Herbrich et al., 2000]
โ€ข Exponential function ๐œ™ ๐‘ง = ๐‘’โˆ’๐‘ง
[Freund et al., 2003]
โ€ข Logistic function ๐œ™ ๐‘ง = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐‘ง
[Burges et al., 2005]
โ€ข Othersโ€ฆ
Pairwise loss minimizes the average number of
inversions in rankingโ€”i.e., ๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž but ๐‘‘๐‘— is
ranked higher than ๐‘‘๐‘–
Given ๐‘ž, ๐‘‘๐‘–, ๐‘‘๐‘— , predict the more relevant document
For ๐‘ž, ๐‘‘๐‘– and ๐‘ž, ๐‘‘๐‘— ,
Feature vectors: ๐‘ฅ๐‘– and ๐‘ฅ๐‘—
Model scores: ๐‘ ๐‘– = ๐‘“ ๐‘ฅ๐‘– and ๐‘ ๐‘— = ๐‘“ ๐‘ฅ๐‘—
Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009.
Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000.
Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Pairwise objectives
RankNet loss
Pairwise loss function proposed by Burges et al. [2005]โ€”an industry favourite
[Burges, 2015]
Predicted probabilities: ๐‘๐‘–๐‘— = ๐‘ ๐‘ ๐‘– > ๐‘ ๐‘— โ‰ก
๐‘’ ๐›พ.๐‘  ๐‘–
๐‘’ ๐›พ.๐‘  ๐‘– +๐‘’
๐›พ.๐‘  ๐‘—
=
1
1+๐‘’
โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘—
Desired probabilities: ๐‘๐‘–๐‘— = 1 and ๐‘๐‘—๐‘– = 0
Computing cross-entropy between ๐‘ and ๐‘
โ„’ ๐‘…๐‘Ž๐‘›๐‘˜๐‘๐‘’๐‘ก = โˆ’ ๐‘๐‘–๐‘—. ๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— โˆ’ ๐‘๐‘—๐‘–. ๐‘™๐‘œ๐‘” ๐‘๐‘—๐‘– = โˆ’๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘—
pairwise
preference
score
0 1
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
A generalized cross-entropy loss
An alternative loss function assumes a single relevant document ๐‘‘+ and compares it
against the full collection ๐ท
Predicted probabilities: p ๐‘‘+|๐‘ž =
๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+
๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘
The cross-entropy loss is then given by,
โ„’ ๐ถ๐ธ ๐‘ž, ๐‘‘+, ๐ท = โˆ’๐‘™๐‘œ๐‘” p ๐‘‘+|๐‘ž = โˆ’๐‘™๐‘œ๐‘”
๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+
๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘
Computing the softmax over the full collection is prohibitively expensiveโ€”LTR models
typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013.
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014.
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
Blue: relevant Gray: non-relevant
NDCG and ERR higher for left but pairwise
errors less for right
Due to strong position-based discounting in
IR measures, errors at higher ranks are much
more problematic than at lower ranks
But listwise metrics are non-continuous and
non-differentiable
LISTWISE
OBJECTIVES
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010.
[Burges, 2010]
Listwise objectives
Burges et al. [2006] make two observations:
1. To train a model we donโ€™t need the costs
themselves, only the gradients (of the costs
w.r.t model scores)
2. It is desired that the gradient be bigger for
pairs of documents that produces a bigger
impact in NDCG by swapping positions
Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006.
LambdaRank loss
Multiply actual gradients with the change in
NDCG by swapping the rank positions of the
two documents
Listwise objectives
According to the Luce model [Luce, 2005],
given four items ๐‘‘1, ๐‘‘2, ๐‘‘3, ๐‘‘4 the probability
of observing a particular rank-order, say
๐‘‘2, ๐‘‘1, ๐‘‘4, ๐‘‘3 , is given by:
where, ๐œ‹ is a particular permutation and ๐œ™ is a
transformation (e.g., linear, exponential, or
sigmoid) over the score ๐‘ ๐‘– corresponding to
item ๐‘‘๐‘–
R Duncan Luce. Individual choice behavior. 1959.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008.
ListNet loss
Cao et al. [2007] propose to compute the
probability distribution over all possible
permutations based on model score and ground-
truth labels. The loss is then given by the K-L
divergence between these two distributions.
This is computationally very costly, computing
permutations of only the top-K items makes it
slightly less prohibitive.
ListMLE loss
Xia et al. [2008] propose to compute the
probability of the ideal permutation based on the
ground truth. However, with categorical labels
more than one permutation is possible.
Listwise objectives
Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009.
Smooth DCG
Wu et al. [2009] compute a โ€œsmoothโ€ rank of
documents as a function of their scores
This โ€œsmoothโ€ rank can be plugged into a
ranking metric, such as MRR or DCG, to
produce a smooth ranking loss
Questions?
@UnderdogGeek bmitra@microsoft.com

Weitere รคhnliche Inhalte

Was ist angesagt?

Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search resultsGanesh Venkataraman
ย 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender SystemsJustin Basilico
ย 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
ย 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationDat Nguyen
ย 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemMarsan Ma
ย 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
ย 
Deep learning based recommender systems (lab seminar paper review)
Deep learning based recommender systems (lab seminar paper review)Deep learning based recommender systems (lab seminar paper review)
Deep learning based recommender systems (lab seminar paper review)hyunsung lee
ย 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN TutorialSungjoon Choi
ย 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
ย 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
ย 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
ย 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNNChanuk Lim
ย 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
ย 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
ย 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...Bhaskar Mitra
ย 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
ย 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
ย 

Was ist angesagt? (20)

Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
ย 
Deep Learning for Recommender Systems
Deep Learning for Recommender SystemsDeep Learning for Recommender Systems
Deep Learning for Recommender Systems
ย 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
ย 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
ย 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
ย 
Embedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking systemEmbedded based retrieval in modern search ranking system
Embedded based retrieval in modern search ranking system
ย 
Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
ย 
Deep learning based recommender systems (lab seminar paper review)
Deep learning based recommender systems (lab seminar paper review)Deep learning based recommender systems (lab seminar paper review)
Deep learning based recommender systems (lab seminar paper review)
ย 
CNN Tutorial
CNN TutorialCNN Tutorial
CNN Tutorial
ย 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
ย 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
ย 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
ย 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
ย 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
ย 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
ย 
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
So, You Want to Release a Dataset? Reflections on Benchmark Development, Comm...
ย 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
ย 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
ย 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
ย 
Generative adversarial text to image synthesis
Generative adversarial text to image synthesisGenerative adversarial text to image synthesis
Generative adversarial text to image synthesis
ย 

ร„hnlich wie Neural Learning to Rank

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
ย 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to RankBhaskar Mitra
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
ย 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Universitat Politรจcnica de Catalunya
ย 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
ย 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfssuser7f0b19
ย 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
ย 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptxEmanAl15
ย 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesArvind Rapaka
ย 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
ย 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
ย 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdfMariaKhan905189
ย 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with pythonSimone Piunno
ย 
ๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน Adaboostๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน AdaboostShocky1
ย 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfTigabu Yaya
ย 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
ย 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxRakshaAgrawal21
ย 

ร„hnlich wie Neural Learning to Rank (20)

Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
ย 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
ย 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
ย 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
ย 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
ย 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ย 
19 - Neural Networks I.pptx
19 - Neural Networks I.pptx19 - Neural Networks I.pptx
19 - Neural Networks I.pptx
ย 
Big data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial UsecasesBig data 2.0, deep learning and financial Usecases
Big data 2.0, deep learning and financial Usecases
ย 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
ย 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
ย 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
ย 
20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf20MEMECH Part 3- Classification.pdf
20MEMECH Part 3- Classification.pdf
ย 
Xgboost
XgboostXgboost
Xgboost
ย 
Neural networks with python
Neural networks with pythonNeural networks with python
Neural networks with python
ย 
ๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน Adaboostๆœบๅ™จๅญฆไน Adaboost
ๆœบๅ™จๅญฆไน Adaboost
ย 
ML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdfML_basics_lecture1_linear_regression.pdf
ML_basics_lecture1_linear_regression.pdf
ย 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
ย 
Dive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptxDive into Machine Learning Event MUGDSC.pptx
Dive into Machine Learning Event MUGDSC.pptx
ย 

Mehr von Bhaskar Mitra

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
ย 
Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?Bhaskar Mitra
ย 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Bhaskar Mitra
ย 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationBhaskar Mitra
ย 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressBhaskar Mitra
ย 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
ย 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra
ย 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBhaskar Mitra
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
ย 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Bhaskar Mitra
ย 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
ย 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
ย 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
ย 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
ย 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcomeBhaskar Mitra
ย 
The Duet model
The Duet modelThe Duet model
The Duet modelBhaskar Mitra
ย 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
ย 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Bhaskar Mitra
ย 

Mehr von Bhaskar Mitra (20)

Joint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and RecommendationJoint Multisided Exposure Fairness for Search and Recommendation
Joint Multisided Exposure Fairness for Search and Recommendation
ย 
Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?Whatโ€™s next for deep learning for Search?
Whatโ€™s next for deep learning for Search?
ย 
Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...Efficient Machine Learning and Machine Learning for Efficiency in Information...
Efficient Machine Learning and Machine Learning for Efficiency in Information...
ย 
Multisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and RecommendationMultisided Exposure Fairness for Search and Recommendation
Multisided Exposure Fairness for Search and Recommendation
ย 
Neural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progressNeural Information Retrieval: In search of meaningful progress
Neural Information Retrieval: In search of meaningful progress
ย 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
ย 
Duet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning TrackDuet @ TREC 2019 Deep Learning Track
Duet @ TREC 2019 Deep Learning Track
ย 
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and BeyondBenchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond
ย 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
ย 
Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)Dual Embedding Space Model (DESM)
Dual Embedding Space Model (DESM)
ย 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
ย 
5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
ย 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
ย 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
ย 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
ย 
Neu-IR 2017: welcome
Neu-IR 2017: welcomeNeu-IR 2017: welcome
Neu-IR 2017: welcome
ย 
The Duet model
The Duet modelThe Duet model
The Duet model
ย 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
ย 
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
Query Expansion with Locally-Trained Word Embeddings (ACL 2016)
ย 

Kรผrzlich hochgeladen

Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
ย 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
ย 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxBhagirath Gogikar
ย 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
ย 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
ย 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
ย 
High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘
High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘
High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘Damini Dixit
ย 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
ย 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .Poonam Aher Patil
ย 
โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….
โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….
โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….Nitya salvi
ย 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
ย 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
ย 
Kochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRL
Kochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRLKochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRL
Kochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRLkantirani197
ย 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Silpa
ย 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
ย 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
ย 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)AkefAfaneh2
ย 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
ย 
High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...chandars293
ย 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
ย 

Kรผrzlich hochgeladen (20)

Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
ย 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
ย 
Introduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptxIntroduction,importance and scope of horticulture.pptx
Introduction,importance and scope of horticulture.pptx
ย 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
ย 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
ย 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
ย 
High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘
High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘
High Profile ๐Ÿ” 8250077686 ๐Ÿ“ž Call Girls Service in GTB Nagar๐Ÿ‘
ย 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
ย 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
ย 
โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….
โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….
โคJammu Kashmir Call Girls 8617697112 Personal Whatsapp Number ๐Ÿ’ฆโœ….
ย 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ย 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
ย 
Kochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRL
Kochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRLKochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRL
Kochi โคCALL GIRL 84099*07087 โคCALL GIRLS IN Kochi ESCORT SERVICEโคCALL GIRL
ย 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
ย 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
ย 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
ย 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
ย 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
ย 
High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad โ‚น7.5k Pick Up & Drop With Cash Payment 969456...
ย 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
ย 

Neural Learning to Rank

  • 1. Neural Learning to Rank Bhaskar Mitra Principal Applied Scientist, Microsoft PhD candidate, University College London @UnderdogGeek
  • 2. Topics A quick recap of neural networks The fundamentals of learning to rank
  • 3. Reading material An Introduction to Neural Information Retrieval Foundations and Trendsยฎ in Information Retrieval (December 2018) Download PDF: http://bit.ly/fntir-neural
  • 4. Most information retrieval (IR) systems present a ranked list of retrieved artifacts
  • 5. Learning to Rank (LTR) โ€... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.โ€ - Liu [2009] Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009. Image source: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45530.pdf
  • 6.
  • 7. A quick recap of neural networks
  • 8. Vectors, matrices, and tensors Image source: https://dev.to/mmithrakumar/scalars-vectors-matrices-and-tensors-with-tensorflow-2-0-1f66 Image source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/ matrix transpose matrix addition dot product matrix multiplication
  • 9.
  • 10. Supervised learning Image source: https://www.intechopen.com/books/artificial-neural-networks-architectures-and-applications/applying-artificial-neural-network-hadron-hadron-collisions-at-lhc
  • 11. Neural networks Chains of parameterized linear transforms (e.g., multiply weight, add bias) followed by non-linear functions (ฯƒ) Popular choices for ฯƒ: Parameters trained using backpropagation E2E training over millions of samples in batched mode Many choices of architecture and hyper-parameters Non-linearity Input Linear transform Non-linearity Linear transform Predicted output forwardpass backwardpass Expected output loss Tanh ReLU
  • 13. Squared loss The squared loss is a popular loss function for regression tasks
  • 14. The softmax function In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes
  • 15. Cross entropy The cross entropy between two probability distributions ๐‘ and ๐‘ž over a discrete set of events is given by, If ๐‘ ๐‘๐‘œ๐‘Ÿ๐‘Ÿ๐‘’๐‘๐‘ก = 1and ๐‘๐‘– = 0 for all other values of ๐‘– then,
  • 16. Cross entropy with softmax loss Cross entropy with softmax is a popular loss function for classification
  • 17. We are given training data: < ๐‘ฅ, ๐‘ฆ > pairs, where ๐‘ฅ is input and ๐‘ฆ is expected output Step 1: Define model and randomly initialize learnable model parameters Step 2: Given ๐‘ฅ, compute model output Step 3: Given model output and ๐‘ฆ, compute loss ๐‘™ Step 4: Compute gradient ๐œ•๐‘™ ๐œ•๐‘ค of loss ๐‘™ w.r.t. each parameter ๐‘ค Step 5: Update parameter as ๐‘ค ๐‘›๐‘’๐‘ค = ๐‘ค ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค , where ๐œ‚ is learning rate Step 6: Go back to step 2 and repeat till convergence Gradient Descent
  • 18. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = ๐œ•๐‘™ ๐œ•๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 19. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = ๐œ• ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐œ•๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 20. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— ๐œ•๐‘ฆ2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 21. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— ๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 ๐œ•๐‘ฆ1 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 22. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— ๐œ•๐‘ฆ1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 23. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— ๐œ•๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐œ•๐‘ค1 Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 24. Goal: iteratively update the learnable parameters such that the loss ๐‘™ is minimized Compute the gradient of the loss ๐‘™ w.r.t. each parameter (e.g., ๐‘ค1) ๐œ•๐‘™ ๐œ•๐‘ค1 = โˆ’2 ร— ๐‘ฆ โˆ’ ๐‘ฆ2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค2. ๐‘ฅ + ๐‘2 ร— ๐‘ค2 ร— 1 โˆ’ ๐‘ก๐‘Ž๐‘›โ„Ž2 ๐‘ค1. ๐‘ฅ + ๐‘1 ร— ๐‘ฅ Update the parameter value based on the gradient with ๐œ‚ as the learning rate ๐‘ค1 ๐‘›๐‘’๐‘ค = ๐‘ค1 ๐‘œ๐‘™๐‘‘ โˆ’ ๐œ‚ ร— ๐œ•๐‘™ ๐œ•๐‘ค1 Gradient Descent Task: regression Training data: ๐‘ฅ, ๐‘ฆ pairs Model: NN (1 feature, 1 hidden layer, 1 hidden node) Learnable parameters: ๐‘ค1, ๐‘1, ๐‘ค2, ๐‘2 ๐‘ฅ ๐‘ฆ1 ๐‘ฆ2 ๐‘™ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค1. ๐‘ฅ + ๐‘1 ๐‘ฆ โˆ’ ๐‘ฆ2 2 ๐‘ฆ ๐‘ก๐‘Ž๐‘›โ„Ž ๐‘ค2. ๐‘ฆ1 + ๐‘2 โ€ฆand repeat
  • 25. Exercise Simple Neural Network from Scratch Implement a simple multi-layer neural network with single input feature, single output, and single neuron per layer using (i) PyTorch and (ii) from scratchโ€”and demonstrate that both approaches produce identical outcome. https://github.com/spacemanidol/AFIRMDeep Learning2020/blob/master/NNPrimer.ipynb
  • 26. Computation Networks The โ€œLegoโ€ approach to specifying neural architectures Library of neural layers, each layer defines logic for: 1. Forward pass: compute layer output given layer input 2. Backward pass: a) compute gradient of layer output w.r.t. layer inputs b) compute gradient of layer output w.r.t. layer parameters (if any) Chain nodes to create bigger and more complex networks
  • 27. Why adding depth helps http://playground.tensorflow.org
  • 29. Bias-variance trade-off in the deep learning era Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical biasโ€“variance trade-off. In PNAS, 2019.
  • 30. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019. Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What's Hidden in a Randomly Weighted Neural Network? In ArXiv, 2019. The lottery ticket hypothesis
  • 33. Problem formulation LTR models represent a rankable itemโ€”e.g., a document or a movie or a songโ€”given some contextโ€”e.g., a user-issued query or userโ€™s historical interactions with other itemsโ€”as a numerical vector ๐‘ฅ โˆˆ โ„ ๐‘› The ranking model ๐‘“: ๐‘ฅ โ†’ โ„ is trained to map the vector to a real-valued score such that relevant items are scored higher.
  • 34. Why is ranking challenging? Examples of ranking metrics Discounted Cumulative Gain (DCG) ๐ท๐ถ๐บ@๐‘˜ = ๐‘–=1 ๐‘˜ 2 ๐‘Ÿ๐‘’๐‘™๐‘– โˆ’ 1 ๐‘™๐‘œ๐‘”2 ๐‘– + 1 Reciprocal Rank (RR) ๐‘…๐‘…@๐‘˜ = max 1<๐‘–<๐‘˜ ๐‘Ÿ๐‘’๐‘™๐‘– ๐‘– Rank based metrics, such as DCG and MRR, are non-smooth / non-differentiable
  • 35. Features They can often be categorized as: Query-independent or static features e.g., incoming link count and document length Query-dependent or dynamic features e.g., BM25 Query-level features e.g., query length Traditional L2R models employ hand-crafted features that encode IR insights
  • 36. Features Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval, Information Retrieval Journal, 2010
  • 37. Approaches Pointwise approach Relevance label ๐‘ฆ ๐‘ž,๐‘‘ is a numberโ€”derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict ๐‘ฆ ๐‘ž,๐‘‘ given ๐‘ฅ ๐‘ž,๐‘‘. Pairwise approach Pairwise preference between documents for a query (๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž) as label. Reduces to binary classification to predict more relevant document. Listwise approach Directly optimize for rank-based metric, such as NDCGโ€”difficult because these metrics are often not differentiable w.r.t. model parameters. Liu [2009] categorizes different LTR approaches based on training objectives: Tie-Yan Liu. Learning to rank for information retrieval. Foundation and Trends in Information Retrieval, 2009.
  • 38. Pointwise objectives Regression loss Given ๐‘ž, ๐‘‘ predict the value of ๐‘ฆ ๐‘ž,๐‘‘ e.g., square loss for binary or categorical labels, where, ๐‘ฆ ๐‘ž,๐‘‘ is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label Norbert Fuhr. Optimum polynomial retrieval functions based on the probability ranking principle. ACM TOIS, 1989. David Cossock and Tong Zhang. Subset ranking using regression. In COLT, 2006. labels prediction 0 1 1
  • 39. Pointwise objectives Classification loss Given ๐‘ž, ๐‘‘ predict the class ๐‘ฆ ๐‘ž,๐‘‘ e.g., cross-entropy with softmax over categorical labels ๐‘Œ [Li et al., 2008], where, ๐‘  ๐‘ฆ ๐‘ž,๐‘‘ is the modelโ€™s score for label ๐‘ฆ ๐‘ž,๐‘‘ labels prediction 0 1 Ping Li, Qiang Wu, and Christopher J Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2008.
  • 40. Pairwise objectives Pairwise loss generally has the following form [Chen et al., 2009], where, ๐œ™ can be, โ€ข Hinge function ๐œ™ ๐‘ง = ๐‘š๐‘Ž๐‘ฅ 0, 1 โˆ’ ๐‘ง [Herbrich et al., 2000] โ€ข Exponential function ๐œ™ ๐‘ง = ๐‘’โˆ’๐‘ง [Freund et al., 2003] โ€ข Logistic function ๐œ™ ๐‘ง = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐‘ง [Burges et al., 2005] โ€ข Othersโ€ฆ Pairwise loss minimizes the average number of inversions in rankingโ€”i.e., ๐‘‘๐‘– โ‰ป ๐‘‘๐‘— w.r.t. ๐‘ž but ๐‘‘๐‘— is ranked higher than ๐‘‘๐‘– Given ๐‘ž, ๐‘‘๐‘–, ๐‘‘๐‘— , predict the more relevant document For ๐‘ž, ๐‘‘๐‘– and ๐‘ž, ๐‘‘๐‘— , Feature vectors: ๐‘ฅ๐‘– and ๐‘ฅ๐‘— Model scores: ๐‘ ๐‘– = ๐‘“ ๐‘ฅ๐‘– and ๐‘ ๐‘— = ๐‘“ ๐‘ฅ๐‘— Wei Chen, Tie-Yan Liu, Yanyan Lan, Zhi-Ming Ma, and Hang Li. Ranking measures and loss functions in learning to rank. In NIPS, 2009. Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression. 2000. Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Singer. An efficient boosting algorithm for combining preferences. In JMLR, 2003. Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005.
  • 41. Pairwise objectives RankNet loss Pairwise loss function proposed by Burges et al. [2005]โ€”an industry favourite [Burges, 2015] Predicted probabilities: ๐‘๐‘–๐‘— = ๐‘ ๐‘ ๐‘– > ๐‘ ๐‘— โ‰ก ๐‘’ ๐›พ.๐‘  ๐‘– ๐‘’ ๐›พ.๐‘  ๐‘– +๐‘’ ๐›พ.๐‘  ๐‘— = 1 1+๐‘’ โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘— Desired probabilities: ๐‘๐‘–๐‘— = 1 and ๐‘๐‘—๐‘– = 0 Computing cross-entropy between ๐‘ and ๐‘ โ„’ ๐‘…๐‘Ž๐‘›๐‘˜๐‘๐‘’๐‘ก = โˆ’ ๐‘๐‘–๐‘—. ๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— โˆ’ ๐‘๐‘—๐‘–. ๐‘™๐‘œ๐‘” ๐‘๐‘—๐‘– = โˆ’๐‘™๐‘œ๐‘” ๐‘๐‘–๐‘— = ๐‘™๐‘œ๐‘” 1 + ๐‘’โˆ’๐›พ. ๐‘  ๐‘–โˆ’๐‘  ๐‘— pairwise preference score 0 1 Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In ICML, 2005. Chris Burges. RankNet: A ranking retrospective. https://www.microsoft.com/en-us/research/blog/ranknet-a-ranking-retrospective/. 2015.
  • 42. A generalized cross-entropy loss An alternative loss function assumes a single relevant document ๐‘‘+ and compares it against the full collection ๐ท Predicted probabilities: p ๐‘‘+|๐‘ž = ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+ ๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘ The cross-entropy loss is then given by, โ„’ ๐ถ๐ธ ๐‘ž, ๐‘‘+, ๐ท = โˆ’๐‘™๐‘œ๐‘” p ๐‘‘+|๐‘ž = โˆ’๐‘™๐‘œ๐‘” ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘+ ๐‘‘โˆˆ๐ท ๐‘’ ๐›พ.๐‘  ๐‘ž,๐‘‘ Computing the softmax over the full collection is prohibitively expensiveโ€”LTR models typically consider few negative candidates [Huang et al., 2013, Shen et al., 2014, Mitra et al., 2017] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. Learning deep structured semantic models for web search using clickthrough data. In CIKM, 2013. Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Gregoire Mesnil. A latent semantic model with convolutional-pooling structure for information retrieval. In CIKM, 2014. Bhaskar Mitra, Fernando Diaz, and Nick Craswell. Learning to match using local and distributed representations of text for web search. In WWW, 2017.
  • 43. Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higher ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable LISTWISE OBJECTIVES Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 2010. [Burges, 2010]
  • 44. Listwise objectives Burges et al. [2006] make two observations: 1. To train a model we donโ€™t need the costs themselves, only the gradients (of the costs w.r.t model scores) 2. It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions Christopher JC Burges, Robert Ragno, and Quoc Viet Le. Learning to rank with nonsmooth cost functions. In NIPS, 2006. LambdaRank loss Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents
  • 45. Listwise objectives According to the Luce model [Luce, 2005], given four items ๐‘‘1, ๐‘‘2, ๐‘‘3, ๐‘‘4 the probability of observing a particular rank-order, say ๐‘‘2, ๐‘‘1, ๐‘‘4, ๐‘‘3 , is given by: where, ๐œ‹ is a particular permutation and ๐œ™ is a transformation (e.g., linear, exponential, or sigmoid) over the score ๐‘ ๐‘– corresponding to item ๐‘‘๐‘– R Duncan Luce. Individual choice behavior. 1959. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. Learning to rank: from pairwise approach to listwise approach. In ICML, 2007. Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. Listwise approach to learning to rank: theory and algorithm. In ICML, 2008. ListNet loss Cao et al. [2007] propose to compute the probability distribution over all possible permutations based on model score and ground- truth labels. The loss is then given by the K-L divergence between these two distributions. This is computationally very costly, computing permutations of only the top-K items makes it slightly less prohibitive. ListMLE loss Xia et al. [2008] propose to compute the probability of the ideal permutation based on the ground truth. However, with categorical labels more than one permutation is possible.
  • 46. Listwise objectives Mingrui Wu, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. Smoothing DCG for learning to rank: A novel approach using smoothed hinge functions. In CIKM, 2009. Smooth DCG Wu et al. [2009] compute a โ€œsmoothโ€ rank of documents as a function of their scores This โ€œsmoothโ€ rank can be plugged into a ranking metric, such as MRR or DCG, to produce a smooth ranking loss