2. Joint work with Katja Hofmann
and Shimon Whiteson
Adapting Rankers Online 2
3. Growing complexity of search engines
Current methods for optimizing mostly work offline
Adapting Rankers Online 3
4. Online learning to rank
No distinction between training and operating
Search engine observes users’ natural interactions
with the search interface, infers information from
them, and improves its ranking function
automatically
Expensive data collection not required; the collected
data matches target users and target setting
Adapting Rankers Online 4
5. Users’ natural interactions with the search
interface Refe r
s
s m a l to
p o s s i le st
b
Minimum scope of i te le s c op e
m
a cte d b ei n g
up o n
Segment Object Class
Behavior category
View, Listen, Scroll,
Examine Find, Query
Select Browse
Bookmark, Save,
Retain Print Delete, Purchase,
Email
Subscribe
Copy-and-paste, Forward, Reply,
Reference Quote Link, Cite
to
R efe rs of
o se
p u rp ve d Annotate Mark up Rate, Publish Organize
o bser io r
v
beh a
Create Type, Edit Author
Oard and Kim, 2001
Adapting Rankers Online 5
Kelly and Teevan, 2004
6. Users’ interactions
Relevance feedback
History goes back close to forty years
Typically used for query expansion, user profiling
Explicit feedback
Users explicitly give feedback
Keywords, selecting or marking documents,
answering questions
Natural explicit feedback can be difficult to obtain
“Unnatural” explicit feedback through TREC
assessors and crowd sourcing
Adapting Rankers Online 6
7. Users’ interactions (2)
Implicit feedback for learning, query expansion and
user profiling
Observe users’ natural interactions with system
Reading time, saving, printing, bookmarking,
selecting, clicking, …
Thought to be less accurate than explicit
measures
Available in very large quantities at no cost
Adapting Rankers Online 7
8. Learning to rank online
Using online learning to rank approaches, retrieval
systems can learn directly from implicit feedback,
while they are running
Algorithms need to explore new solutions to obtain
feedback for effective learning and exploit what has
been learned to produce results acceptable to users
Interleaved comparison methods can use implicit
feedback to detect small differences between
rankers and can be used to learn ranking functions
online
Adapting Rankers Online 8
9. Agenda
Balancing exploration and exploitation
Inferring preferences from clicks
Adapting Rankers Online 9
10. Rec
en
wor t
k
Balancing Exploitation
and Exploration
K. Hofmann et al. (2011), Balancing exploration and exploitation. In:
ECIR ’11.
Adapting Rankers Online 10
11. Challenges
Generalize over queries and documents
Learn from implicit feedback that is …
noisy
relative
rank-biased
Keep users happy while learning
Adapting Rankers Online 11
12. Learning document pair-wise preferences
Vienna
Insight: infer preferences
from clicks
Joachims, T. (2002). Optimizing search engines using
clickthrough data. In KDD '02, pages 133-142. Adapting Rankers Online 12
13. Learning document pair-wise preferences
Input: feature vectors constructed from document
( (q, di ), (q, dj )) ∈ Rn × Rn
pairs x x
Output: y ∈ {−1, +1} correct / incorrect order
Learning method: supervised learning, e.g., SVM
Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD '02,
pages 133-142. Adapting Rankers Online 13
14. Challenges
Generalize over queries and documents
Learn from implicit feedback that is …
noisy
relative
rank-biased
Keep users happy while learning
Adapting Rankers Online 14
15. Dueling bandit gradient descent
Learns a ranking function consisting of a weight vector
for a linear weighted combination of feature vectors
from feedback about relative quality of rankings
Outcome: weights for ranking S = w (q, d)
x
Approach
Maintain a current “best” ranking function
candidate w
On each incoming query: x2
current best w
Generate a new candidate ranking function
Compare to current “best” x1
If candidate is better, update “best” ranking function
Yue, Y. and Joachims, T. (2009). Interactively optimizing information
retrieval systems as a dueling bandits problem. In ICML '09.
Adapting Rankers Online 15
16. Challenges
Generalize over queries and documents
Learn from implicit feedback that is …
noisy
relative
rank-biased
Keep users happy while learning
Adapting Rankers Online 16
17. Exploration and exploitation
Need to learn effectively Need to present high-
from rank-biased quality results while
feedback learning
Exploration Exploitation
Previous approaches are either purely exploratory or
purely exploitative
Adapting Rankers Online 17
18. Questions
Can we improve online performance by balancing
exploration and exploitation?
How much exploration is needed for effective
learning?
Adapting Rankers Online 18
19. Problem formulation
Reinforcement learning
No explicit labels
Learn from feedback from the environment in
response to actions (document lists)
Contextual bandit problem
try something documents
Retrieval Environment Retrieval Environment
system (user) system (user)
get feedback clicks
Adapting Rankers Online 19
20. Our method
Learning based on Dueling Bandit Gradient Descent
Relative evaluations of quality of two document
lists
Infers such comparisons from implicit feedback
Balance exploration and exploitation with k-greedy
comparison of document lists
Adapting Rankers Online 20
21. k-greedy exploration
To compare document
lists, interleave
An exploration rate k
influences the relative
number of documents
from each list Blue wi n
c o mp a r s
is o n
n
Exp l o ratio
rate k = 0.5
Adapting Rankers Online 21
22. k-greedy exploration
atio n atio n
Exp l o r 0.5 Exp l o r 0.2
rate k = rate k =
Adapting Rankers Online 22
23. Evaluation
Simulated interactions
We need to
observe clicks on arbitrary result lists
measure online performance
Simulate clicks and measure online performance
Probabilistic click model: assume dependent click
model and define click and stop probabilities based
on standard learning to rank data sets
Measure cumulative reward of the rankings
displayed to the user
Adapting Rankers Online 23
24. Experiments
Vary exploration rate k
Three click models
“perfect”
“navigational”
“informational”
Evaluate on nine data sets (LETOR 3.0 and 4.0)
Adapting Rankers Online 24
25. “Perfect” click model
0.8
0.6
Click model
0.4
P(c|R) P(c|NR) P(s|R) P(s|NR)
0.2
1.0 0.0 0.0 0.0
0.0
0 200 400 600 800 1000
Final performance over time for data set
NP2003 and perfect click model
Provides an upperbound
Adapting Rankers Online 25
26. “Perfect” online performance
k = 0.5 k = 0.4 k = 0.3 k = 0.2 k = 0.1
HP2003 119.91 125.71 129.99 130.55 128.50
HP2004 109.21 111.57 118.54 119.86 116.46
117.44 fo r m a n
ce
NP2003 108.74 113.61
Bes t per 120.46
o
119.06
124.47 n l y t w
NP2004 112.33 119.34
with o 126.20
y
123.70
TD2003 82.00 84.24 88.20 r ato r 89.36
exp lo 86.20
or
e nts f91.71
TD2004 85.67 90.23 do c u m
91.00 88.98
OHSUMED 128.12 130.40 top- 01
131.16 re s u lts
133.37 131.93
MQ2007 96.02 97.48 98.54 100.28 98.32
MQ2008 90.97 92.99 94.03 95.59 95.14
Darker shades indicate higher performance
125.71 Dark borders indicate significant improvements over the k = 0.5 baseline
Adapting Rankers Online 26
27. “Navigational” click model
0.8
0.6
Click model
0.4
P(c|R) P(c|NR) P(s|R) P(s|NR)
0.2
0.95 0.05 0.9 0.2
0.0
0 200 400 600 800 1000
Final performance over time for data set
Simulate realistic but NP2003 and navigational click model
reliable interaction
Adapting Rankers Online 27
28. “Navigational” online performance
k = 0.5 k = 0.4 k = 0.3 k = 0.2 k = 0.1
HP2003 102.58 109.78 118.84 116.38 117.52
HP2004 89.61 97.08 99.03 103.36 105.69
NP2003 90.32 100.94 Be st p e r fo r m a n c e
105.03 108.15 110.12
NP2004 99.14 104.34
t le
110.16 h l i t 112.05
wit 116.00
TD2003 70.93 75.20 ex plo
77.64ratio n77.54dan 75.70
TD2004 78.83 80.17 82.40 ot s o f 83.54
l 80.98
OHSUMED 125.35 126.92 127.37 l o i t at i o n
exp 127.94 127.21
MQ2007 95.50 94.99 95.70 96.02 94.94
MQ2008 89.39 90.55 91.24 92.36 92.25
Darker shades indicate higher performance
125.71 Dark borders indicate significant improvements over the k = 0.5 baseline
Adapting Rankers Online 28
29. “Informational” click model
0.8
k = 0.5 k = 0.2 k = 0.1
0.6
Click model
0.4
P(c|R) P(c|NR) P(s|R) P(s|NR)
0.2
0.9 0.4 0.5 0.1
0.0
0 200 400 600 800 1000
Simulate very noisy Final performance over time for data set
NP2003 and informational click model
interaction
Adapting Rankers Online 29
30. “Informational” online performance
k = 0.5 k = 0.4 k = 0.3 k = 0.2 k = 0.1
HP2003 59.53 63.91 61.43 70.11 71.19
HP2004 41.12 52.88
st 55.88
H i g h e 58.40
48.54 55.16
e nts63.23t h
wi
NP2003 53.63 53.64 57.60 69.90
63.38 p ro ve m
im
o n55.76 te s:
NP2004 60.59 64.17 69.96
51.58 at i
r ra
TD2003 52.78
l o w exp l o
52.95 57.30
59.75 n b et we e n
TD2004 58.49
i nte ra
61.43
ctio 62.88 63.37
126.76 et
as
OHSUMED
MQ2007
121.39
91.57
123.26
92.00
124.01
an
n o ise91.66 d dat90.79 125.40
90.19
MQ2008 86.06 87.26 85.83 87.62 86.29
Darker shades indicate higher performance
125.71 Dark borders indicate significant improvements over the k = 0.5 baseline
Adapting Rankers Online 30
31. Summary
What?
Developed first method for balancing exploration and
exploitation in online learning to rank
Devised experimental framework for simulating user
interactions and measuring online performance
And so?
Balancing exploration and exploitation improves online
performance for all click models and all data sets
Best results are achieved with 2 exploratory
documents per results list
Adapting Rankers Online 31
32. What’s next here?
Validate simulation assumptions
Evaluate using on click logs
Develop new algorithms for online learning to rank
for IR that can balance exploration and exploitation
Adapting Rankers Online 32
33. Ongo
ing
Inferring Preferences
work
from Clicks
Adapting Rankers Online 33
34. Interleaved ranker comparison methods
Use implicit feedback (“clicks”), not to infer absolute
judgments, but to compare two rankers by observing
clicks on an interleaved result list
Interleave two ranked lists (“outputs of two rankers”)
Use click data to detect even very small differences
between rankers
Examine three existing methods for interleaving,
identify issues with them and propose a new one
Adapting Rankers Online 34
35. Three methods (1)
Balanced interleave method
Interleaved list is generated for each query based
on the two rankers
User’s clicks on interleaved list are attributed to
each ranker based on how they ranked the clicked
docs
Ranker that obtains more clicks is deemed
superior
Joachims, Evaluating retrieval performance Adapting Rankers Online 35
using clickthrough data, In: Text Mining, 2003
36. 1) Interleaving 2) Comparison
List l1 List l2
d1 d2 d1 d2
d2 d3 x d2 x d1
observed
clicks c
d3 d4 d3 d3
d4 d1 x d4 x d4
k = min(4,3) = 3 k = min(4,4) = 4
Two possible interleaved lists l: click count:
click count:
c1 = 1 c1 = 2
d1 d2 c2 = 2
c2 = 2
d2 d1
d3 d3 l2 wins the first comparison, and the lists tie for
d4 d4 the second. In expectation l2 wins.
Adapting Rankers Online 36
37. Three methods (2)
Team draft method
Create an interleaved list following the model of
“team captains” selecting their team from a set of
players
For each pair of documents to be placed in the
interleaved list, a coin flip determines which list
gets to select a document first
Record which document contributed which
document
Radlinski et al., How does click-through data reflect Adapting Rankers Online 37
retrieval quality? 2008
38. 1) Interleaving 2) Comparison
assignments a
List l1 List l2
d1 d2 a) c)
d2 d3 d1 1 d2 2
d3 d4 d2 2 d1 1
d4 d1 x d3 1 x d3 2
d4 2 d4 1
Four possible interleaved lists l,
with different assignments a: b) d)
d2 2 d1 1
For the interleaved lists a) and b) l1 d1 1 d2 2
wins the comparison. l2 wins in the x d3 1 x d3 2
other two cases. d4 2 d4 1
Adapting Rankers Online 38
39. Three methods (3)
Document-constraint method
Result lists are interleaved and clicks observed as
for the balanced interleaved method
Infer constraints on pairs of individual documents
based on clicks and ranks
For each pair of a clicked document and a higher-ranked non-
clicked document, a constraint is inferred that requires the
former to be ranked higher than the latter
The original list that violates fewer constraints is deemed
superior
He et al., Evaluation of methods for relative comparison of retrieval Adapting Rankers Online 39
systems based on clickthroughs, 2009
40. 1) Interleaving 2) Comparison
List l1 List l2
d1 d2 d1 d2
d2 d3 x d2 x d1
d3 d4 x d3 x d3
d4 d1 d4 d4
inferred constraints inferred constraints
Two possible interleaved lists l:
violated by: l1 l2 violated by: l1 l2
d1 d2 d2 ≻ d1 x - d1 ≻ d2 - x
d2 d1 d3 ≻ d1 x - d3 ≻ d2 x x
d3 d3 l2 wins the first comparison, and loses the
d4 d4 second. In expectation l2 wins.
Adapting Rankers Online 40
41. Assessing comparison methods
Bias
Don’t prefer either ranker when clicks are random
Sensitivity
The ability of a comparison method to detect
differences in the quality of rankings
Balanced interleave and document constraint are
biased
Team draft may suffer from insensitivity
Adapting Rankers Online 41
42. A new proposal
Briefly
Based on team draft
Instead of interleaving deterministically, model the
interleaving process as random sampling from
softmax functions that define probability
distributions over documents
Derive an estimator that is unbiased and sensitive
to small ranking changes
Marginalize over all possible assignments to make
estimates more reliable
Adapting Rankers Online 42
43. 1) Probabilistic Interleave 2) Probabilistic marginalize over all possible assignments:
Comparison
l1 ! softmax s1 l2 ! softmax s2
a o(ci,a) P(a|li,qi)
d1 d2 P(dr=1)= 0.85 1 1 1 1 2 0 0.053
Observe data, e.g. 1 1 1 2 2 0 0.053
d2 d3 P(dr=2)= 0.10
d1 1 1 1 2 1 1 1 0.058
d3 d4 P(dr=3)= 0.03 x d2 2 1 1 2 2 1 1 0.058
d4 d1 P(dr=4)= 0.02
x d3 1 1 2 1 1 1 1 0.065
For each rank of the interleaved list l draw one of {s1, s2} and d4 2 1 2 1 2 1 1 0.065 P(c1 c2) = 0.108
sample d: 1 2 2 1 0 2 0.071 P(c1 c2) = 0.144
s1 d4
1 2 2 2 0 2 0.071
s1 d3
d2 s2 d4 2 1 1 1 2 0 0.001
s1 s2 d4 ... 2 1 1 2 2 0 0.001 s2 (based on l2) wins
d1 d3 ... 2 1 2 1 1 1 0.001 the comparison. s1 and
s2 2 1 2 2 1 1 0.001 s2 tie in expectation.
s1 d2 ... d4 ...
2 2 1 1 1 1 0.001
2 2 1 2 1 1 0.001
s2 d3 ... 2 2 2 1 0 2 0.001
All permutations of documents
d4 ... in D are possible. 2 2 2 2 0 2 0.001
For an incoming query ...
System generates All possible assignments are
generated;
interleaved list Probability of each is computed
Observe clicks
Expensive; only need to do this
Compute probability of until the lowest observed click
each possible outcome
Adapting Rankers Online 43
44. Question
Do analytical differences between the methods
translate into performance differences?
Adapting Rankers Online 44
45. Evaluation
Set-up
Simulation based on dependent click model
Perfect and realistic instantiations
Not binary, but with relevance levels
MSLR-WEB30k Microsoft learning to rank data set
136 doc features (i.e., rankers)
Three experiments
Exhaustive comparison of all distinct ranker pairs
9,180 distinct pairs
Selection of small subsets for detailed analysis
Add noise
Adapting Rankers Online 45
46. Results (1)
Experiment 1
Accuracy
Percentage of pairs of rankers for which a comparison
method identified the better ranker after 1000 queries
Method Accuracy
balanced interleave 0.881
team draft 0.898
document constraint 0.857
new 0.914
Adapting Rankers Online 46
47. Results (2): overview
“Problematic” pairs
Pairs of rankers for which
all methods correctly
identified the better one
Three achieved perfect
accuracy within 1000
queries
For each method,
incorrectly judged pair with
highest difference in
NDCG
Adapting Rankers Online 47
50. Summary
What?
Methods for evaluating rankers using implicit
feedback
Analysis of interleaved comparison methods in
terms of bias and sensitivity
And so?
Introduced a new probabilistic interleaved
comparison method, unbiased and sensitive
Experimental analysis: more accurate, with
substantially fewer observed queries, more robust
Adapting Rankers Online 50
51. What’s next here?
Evaluate in a real-life setting in the future
With more reliable and faster convergence, our
approach can pave the way for online learning to
rank methods that require many comparisons
Adapting Rankers Online 51
53. Online learning to rank
Emphasis on implicit feedback collected during
normal operation of the search engine
Balancing exploration and exploitation
Probabilistic method for inferring preferences from
clicks
Adapting Rankers Online 53
54. Information retrieval observatory
Academic experiments on online learning and
implicit feedback used simulators
Need to validate the simulators
What’s really needed
Move away from artificial explicit feedback to
natural implicit feedback
Shared experimental environment for observing
users in the wild as they interact with systems
Adapting Rankers Online 54
57. Bias
1) Interleaving 2) Comparison
List l1 List l2
d1 d2 d1 d2
d2 d3 x d2 x d1
observed
clicks c
d3 d4 d3 d3
d4 d1 x d4 x d4
k = min(4,3) = 3 k = min(4,4) = 4
Two possible interleaved lists l: click count:
click count:
c1 = 1 c1 = 2
d1 d2 c2 = 2
c2 = 2
d2 d1
d3 d3 l2 wins the first comparison, and the lists tie for
d4 d4 the second. In expectation l2 wins.
Adapting Rankers Online 57
58. Sensitivity
1) Interleaving 2) Comparison
assignments a
List l1 List l2
d1 d2 a) c)
d2 d3 d1 1 d2 2
d3 d4 d2 2 d1 1
d4 d1 x d3 1 x d3 2
d4 2 d4 1
Four possible interleaved lists l,
with different assignments a: b) d)
d2 2 d1 1
For the interleaved lists a) and b) l1 d1 1 d2 2
wins the comparison. l2 wins in the x d3 1 x d3 2
other two cases. d4 2 d4 1
Adapting Rankers Online 58