SlideShare a Scribd company logo
1 of 85
Download to read offline
Which Diversity Evaluation
Measures Are “Good”?
Preprint: http://waseda.box.com/sigir2019
This slide deck: http://www.slideshare.net/tetsuyasakai/sigir2019
Tetsuya Sakai and Zhaohao Zeng
Waseda University, Japan
tetsuyasakai@acm.org
zhaohao@fuji.waseda.jp
24th July@SIGIR2019, Paris.1
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
2
System improvements that don’t
matter to the user
User’s perception
of SERP quality
Evaluation measure score
System System System
3
Measures that can’t detect
differences that matter to the user
User’s perception
of SERP quality
Evaluation measure score
System
System
System
4
We need good evaluation measures
User’s perception
of SERP quality
Evaluation measure score
System
System
System
System
“good” =
align well with
user perception
5
Which of them are “good”?
Adhoc IR
• AP
• nDCG
• ERR
• RBP
• Q-measure
• TBG
• U-measure …
Diversified IR
• α-nDCG
• Intent-Aware (IA)
measures
• D(#)-measures
• RBU …
Diversity measures
tend to be complex.
6
Measures derived from axioms (1)
An axiom from [Amigo+18] :
Given the above SERP, which of the two new reldocs
should be appended to it?
#reldocs for Intent i
>
#reldocs for Intent i’
reldoc for i reldoc for i’
SERP
7
Measures derived from axioms (2)
An axiom from [Amigo+18] :
Given the above SERP, which of the two new reldocs
should be appended to it?
reldoc for i
reldoc for i’
SERP
This one because the
SERP will be more
balanced
Axioms are useful for understanding
the properties of measures. But… 8
Measures derived from axioms (3)
An axiom from [Amigo+18] :
Assumptions:
(1) binary intentwise relevance
(2) flat intent probabilities
(3) a document is never relevant to multiple intents
reldoc for I’
SERP
How practical are they? 9
How much do axioms matter to
real users?
For designing a “good” measure,
• Is each axiom necessary/practical?
• Are the set of axioms sufficient?
Let’s just collect lots of user SERP preferences
and see which measures actually align with them!
SERP A SERP B
users (gold): >
Measure 1: >
Measure 2: < 10
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
11
Measures considered
Adhoc IR
• Q
• ERR
• EBR (new!)
• RBP
• iRBU (new!)
• nDCG
Diversified IR
• D-{Q, ERR, EBR, RBP,
nDCG}
• D#-{Q, ERR, EBR, RBP,
nDCG}
• {Q, ERR, EBR, RBP,
nDCG}-IA
• RBU
Normalised
Cumulative
Utility family
12
Normalised Cumulative Utility (1)
[Sakai+08EVIA]
:
r
1
2
3
:
Population of
users who scan
the ranked list
13
Normalised Cumulative Utility (2)
:
r
1
2
3
:
Stopping probability at r
Users who abandon the list at r=1
Users who abandon the list at r=3
14
Normalised Cumulative Utility (3)
:
r
1
2
3
:
Measure utility of
this doc for this user
group
Measure utility of
these docs for this
user group
Utility at r
NCU is “expected utility”
15
Q-measure is an NCU (1)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Partially rel: 1
16
Q-measure is an NCU (2)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
BR(2)
= 4/6
BR(5)
= 6/10
Q
= ( BR(2) + BR(5) + 0 ) / 3
= 0.422
Q generalizes AP by
using the Blended Ratio
instead of Prec as Utility
17
BR combines Prec and Normalised
Cumulative Gain (1)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
Prec(2)
= 1/2
Highly rel: 3
Partially rel: 1
Partially rel: 1
Ideal list
cg(r) cg*(r)
Cumulative gain
0
3
3
3
4
3
4
5
5
5
BR(2)
= (1+3)/(2+4)
= 4/6
with β=1
18
BR combines Prec and Normalised
Cumulative Gain (2)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant Prec(5)
= 2/5
Highly rel: 3
Partially rel: 1
Partially rel: 1
Ideal list
cg(r) cg*(r)
Cumulative gain
0
3
3
3
4
3
4
5
5
5
BR(5)
= (2+4)/(5+5)
= 6/10
with β=1
19
Patience parameter β of BR
(binary relevance environment)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
β=0.1
β=1
β=10
r1 <= R ⇒
BR(r1)=(1+β)/(r1+βr1)=1/r1
r1 > R ⇒
BR(r1)=(1+β)/(r1+βR)
r1 : rank of the 1st relevant doc
Large β ⇒ more
tolerance to relevant
docs at low ranks
BR(r1) R=5
20
ERR is an NCU (1) [Chapelle+09]
Nonrelevant
Highly rel:
3/4
Nonrelevant
Partially rel:
1/4
Nonrelevant
Stopping probability distribution:
accommodates diminishing return
1
0
21
ERR is an NCU (2)
Nonrelevant
Highly rel:
3/4
Nonrelevant
Partially rel:
1/4
Nonrelevant
Stopping probability distribution:
accommodates diminishing return
0
1 3/4
1/4
P ERR (2) = 3/4
22
ERR is an NCU (3)
Nonrelevant
Highly rel:
3/4
Nonrelevant
Partially rel:
1/4
Nonrelevant
Stopping probability distribution:
accommodates diminishing return
0
1 3/4
1/4
1
0
0
1 1/4
3/4
P ERR (5) = 1/4 * 1/4 = 1/16
23
ERR is an NCU (2)
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
RR(2)
= 1/2
RR(5)
= 1/5
ERR
= 3/4 * 1/2 + 1/16 * 1/5
= 0.388
P ERR (2) = 3/4
P ERR (5) = 1/16
Only the final doc is
considered useful
(binary relevance)
24
New measure:
EBR (Expected Blended Ratio)
Measure Stopping
probabilities
Utility
Q Uniform BR
ERR Diminishing return RR
EBR Diminishing return BR
Only the final doc is
considered useful
(binary relevance)
EBR utilises graded relevance for both
25
RBP [Moffat+08] is an NCU
Nonrelevant
Highly rel:
3/3
Nonrelevant
Partially rel:
1/3
Nonrelevant
Stopping probability distribution:
does not consider document relevance
1-p
p 1-p
p
p
p
p
P RBP (2) = p^(2-1) * (1-p)
1-p
1-p
1-p
P RBP (5) = p^(5-1) * (1-p)
26
New measure: intentwise
Rank-Biased Utility (iRBU)
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
p^2
p^5
iRBU
= 3/4 * p^2 + 1/16 * p^5
P ERR (2) = 3/4
P ERR (5) = 1/16
Ignores relevance;
inverse effort rather
than utility
27
Summary of adhoc measures
(nDCG omitted)
Measure Stopping
probabilities
Utility
Q Uniform BR
ERR Diminishing return RR
EBR Diminishing return BR
iRBU Diminishing return p^r
iRBU is a component of RBU [Amigo+18],
a diversity measure 28
Measures considered
Adhoc IR
• Q
• ERR
• EBR (new!)
• RBP
• iRBU (new!)
• nDCG
Diversified IR
• D-{Q, ERR, EBR, RBP,
nDCG}
• D#-{Q, ERR, EBR, RBP,
nDCG}
• {Q, ERR, EBR, RBP,
nDCG}-IA
• RBU
29
Diversified search
• Given an ambiguous/underspecified query, produce a
single Search Engine Result Page that satisfies
different user intents!
• Challenge: balancing relevance and diversity
SERP(SearchEngineResultPage)
Highly relevant
near the top
Give more
space to
popular intents?
Give more space
to informational
intents?
Cover many
intents
30
Two different approaches to
evaluating diversified search
• Intent-Aware measures [Agrawal+09]
(1) Compute a measure for each intent
(2) Combine the measures using intent probabilities as
weights
• D(#)-measures [Sakai+11SIGIR]
(1) Combine intentwise graded relevance with intent
probabilities to compute the gain of each document
(2) Construct an ideal list based on the gain, and then
compute a graded relevance measure based on it
31
RBU [Amigo+18]
Intent-aware
iRBU
Effort
penalty
We let e=0.01 throughout this study
32
D-measures (1)
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
Partially rel:1 Partially rel:1
Reldoc1
Reldoc2
Reldoc3
Per-intent gain values
gi gj
Intent j:
“pottermore.com”
Pr(j|q) = 0.3
R = 3 relevant
documents
2 intents
33
D-measures (2)
Reldoc1
Reldoc2
Reldoc3
0.7*1+0.3*7=2.8
0.7*1+0.3*1=1.0
0.7*3+0.3*0=2.1
D-DCG*
= 2.8 + 2.1/log2(2+1) +1.0/log2(3+1)
= 4.62
Per-intent gain values
gi gj
R = 3 relevant
documents
2 intents
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
“pottermore.com”
Pr(j|q) = 0.3
Ideal list based on
global gains
Pr(i|q) gi + Pr(j|q) gj
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
Partially rel:1 Partially rel:1
34
D-measures (3)
nonrel
nonrel
2.1
nonrel
Reldoc1
Reldoc2
Reldoc3Reldoc2
Ideal list based on
global gains
Pr(i|q) gi + Pr(j|q) gj
D-DCG
= 2.1/log2(3+1)
= 1.05
D-DCG*
= 4.62
D-nDCG =
D-DCG/D-DCG*
= 0.23
Per-intent gain values
gi gj
SERP to be
evaluated
R = 3 relevant
documents
2 intents
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
“pottermore.com”
Pr(j|q) = 0.3
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
Partially rel:1 Partially rel:1
0.7*1+0.3*7=2.8
0.7*1+0.3*1=1.0
0.7*3+0.3*0=2.1
35
Intent recall (aka
subtopic recall [Zhai+03] )
I-rec =
#intents covered by SERP / #intents
= 1/2
nonrel
nonrel
nonrel
Reldoc2
Per-intent gain values
gi gj
R = 3 relevant
documents
2 intents
Reldoc1
Reldoc2
Reldoc3Only Intent i is
covered by SERP
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
“pottermore.com”
Pr(j|q) = 0.3
SERP to be
evaluated
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
Partially rel:1 Partially rel:1
36
D#-measure = γ I-rec + (1-γ) D-measure
D#-nDCG
contour
lines
Pure
diversity
Overall
relevance
Official results from the NTCIR-10
INTENT-2 task [Sakai+13IRJ]
37
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
38
NTCIR-9 INTENT-1 data [Song+11]
• Corpus: clueweb09 Japanese
• 100 topics
• 10.91 intents/topic
• 15 runs
• 5-point relevance grades
topic
Intent 1 Intent 2
L4 (15) L3 (7)
L3 (7) L0 (0)
L1 (1) L1 (1)
0.30.7
Exponential gain
value setting
39
Creating graded adhoc qrels
from the diversity qrels
topic
Intent 1 Intent 2
L4 (15) L3 (7)
L3 (7) L0 (0)
L1 (1) L1 (1)
0.30.7
Used for computing diversity measures
for the INTENT-1 runs
topic
log2( 4+3 +1) = 3 ⇒ L3 (7)
log2( 3+0 +1) = 2 ⇒ L2 (3)
log2( 1+1 +1) = 1 ⇒ L1 (1)
Used for computing adhoc measures
for the INTENT-1 runs
40
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
41
Rank correlation
System ranking by Measure X System ranking by Measure Y
System A
System B
System O
:
System B
System A
System O
:
swap
swap
Quantify consistency with
Kendall’s tau rank correlation
(Does not tell us which measure is better)
42
Prec is very different
from other adhoc measures
(highest tau: .572)
43
ERR is slightly different
from other adhoc measures
44
EBR, nDCG, Q, RBP, iRBU
are very similar
45
I-rec is very different
from D-measures
(highest tau: .752)
46
D-ERR is slightly different
from other D-measures
47
D-{nDCG, Q, RBP}
are very similar
48
RBU, {EBR, ERR}-IA
are very similar
since they all use
intentwise
diminishing return
49
Rank correlation summary
• Adhoc
- Prec is different (so is ERR)
- EBR, nDCG, Q, RBP, iRBU are similar
• Diversity
- I-rec is different (so is D-ERR)
- D-{nDCG, Q, RBP} are similar
- RBU, {EBR, ERR}-IA are similar
So which measures are “good”? We can’t tell.
50
Discriminative power
[Sakai06SIGIR, Sakai12WWW]
• Compute a Randomised Tukey HSD test p-value for
every system pair (15*14/2=105 pairs)
• Quantifies how often statistically significant results
can be obtained
Significance level: 5%
51
Discriminative power: adhoc
RBP, nDCG, iRBU, Q > EBR >> ERR >> Prec
52
Discriminative power: diversity
D#-measures >> IA measures.
RBU is also discriminative. 53
Discriminative power summary
• Adhoc
RBP, nDCG, iRBU, Q (39-42 sig. differences)
> EBR >> ERR >> Prec (only 14 sig. differences)
• Diversity
D#-measures >> IA measures
e.g. D#-EBR (48 sig. differences)
>> EBR-IA (only 30 sig. differences)
RBU is also discriminative
But discriminative does not necessarily mean correct
54
Unanimity (1) [Amigo+18]
topic-SERP1-SERP2 triplet A
topic-SERP1-SERP2 triplet B
topic-SERP1-SERP2 triplet C
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
A high-unanimity measure =
one that agrees with many other measures
re: SERP preferences
55
Unanimity (2)
topic-SERP1-SERP2 triplet A
topic-SERP1-SERP2 triplet B
topic-SERP1-SERP2 triplet C
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
U({M1}) =
{<A, GT>,
<B, EQ>,
<C, LT>}
size[U({M1})]
= 1+0.5+1
= 2.5
56
Unanimity (3)
topic-SERP1-SERP2 triplet A
topic-SERP1-SERP2 triplet B
topic-SERP1-SERP2 triplet C
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
U({M1}) =
{<A, GT>,
<B, EQ>,
<C, LT>}
size[U({M1})]
= 2.5
U({M2, M3}) =
{<A, GT>}
size[U({M2, M3})]
= 1
M2 and M3 agree
57
Unanimity (4)
topic-SERP1-SERP2 triplet A
topic-SERP1-SERP2 triplet B
topic-SERP1-SERP2 triplet C
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
U({M1}) =
{<A, GT>,
<B, EQ>,
<C, LT>}
size[U({M1})]
= 2.5
U({M2, M3}) =
{<A, GT>}
size[U({M2, M3})]
= 1
U({M1}) ∩ U({M2, M3}) =
{<A, GT>}
size[U({M1}) ∩ U({M2, M3})]
=1
intersection
All measures agree
Unanimity of M1 = log2 [(1/3)/{(2.5/3)*(1/3)}] = 0.263
58
Unanimity results
Unlike [Amigo+18], RBU’s
results are unremarkable…
59
Unanimity summary
• RBP, RBP-IA, D#-RBP (all with p=0.99) have high
unanimity scores.
• RBU’s scores are unremarkable.
• Unanimity results depend heavily on the set of
measures considered.
• Is a measure that agrees with many other measures
“good”?
• Is a person that agrees with many other people
“good”?
60
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
61
Offline results don’t really tell us
which measures are “good”
Let’s just collect lots of user SERP preferences
and see which measures actually align with them!
SERP A SERP B
users (gold): >
Measure 1: >
Measure 2: <
Kendall’s tau for Measure i =
(#agreements_w_gold - #disagreements_wi_gold)/#pairs
62
SERP pairs judged
• 100 topics * 105 system pairs
= 10,500 topic-SERP-SERP triplets
⇒ filter by ΔPrec>0.1 AND ΔI-rec>0.1 (SERP pairs
should be different): 1,258 triplets
⇒ removed incomplete SERPs and problematic
HTML files: 1,127 triplets were used.
63
Judgement interface
64
Judges
• 15 Japanese-course computer science students at
Waseda University
• Presentation order of triplets were randomized
• Presentation order of the relevance and diversity
questions were also randomized
• Instruction said judges were expected to process
one triplet in 3 minutes on average – the actual
time spent was 50.1 seconds.
65
A note on [Sanderson+10]
• They also collected SERP preference data, but a
TREC web subtopic (not the entire topic) was given
to each judge.
• The subtopic-based preferences were aggregated
to form the gold preference for the entire topic.
• Hence their preference data is not about which
SERP is more diversified.
• Our diversity question “Which SERP is more likely
to satisfy a higher number of users?” is new!
66
Judgement reliability
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Diversity preferences
Krippendorff’s alpha
= 0.406
Krippendorff’s alpha
= 0.356
Substantial agreement beyond chance
67
Did every judge do the job
properly? YES
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Each cell is either
LEFT, irrelevant, or RIGHT
14 judges
1,127 SERP pairs
LEAVE OUT ONE ASSESSOR
Alpha does not
change much even if
we remove Judge 01
68
Full gold data
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Diversity preferences
Krippendorff’s alpha
= 0.406
Krippendorff’s alpha
= 0.356
full gold data full gold data
1,115 SERP pairs 1,119 SERP pairs
Final label Final label
Majority vote Majority vote
69
High agreement gold data
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Diversity preferences
high-agreement gold
data
high-agreement gold
data
894 SERP pairs 897 SERP pairs
Final label Final label
Majority vote Majority vote
Krippendorff’s alpha
= 0.406 → 0.518
Krippendorff’s alpha
= 0.356 → 0.453
Keep columns
where at least
9/15 agreed
70
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
71
Agreement with gold relevance (1)
72
Agreement with gold relevance (2)
iRBU (p=0.99) and nDCG perform best;
RBP (p=0.99, 0.85), Q, and EBR also do well.
73
Agreement with gold relevance (3)
Prec does not perform well
⇒ users care about ranking;
ERR performs worst
⇒ users care about the number of relevant docs.
74
iRBU (p=0.99) performs surprisingly well!
Suggests that it may be good to ignore
relevance in the Utility function!
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
0.99^2
= 0.980
0.99^5
=0.951
iRBU
= 3/4 * 0.99^2
+ 1/16 * 0.99^5
= 0.795
P ERR (2) = 3/4
P ERR (5) = 1/16
75
Ignores relevance;
inverse effort rather
than utility
Agreement with gold diversity (1)
76
Agreement with gold diversity (2)
D#-{nDCG, RBP (p=0.85)} perform best;
Other D#-measures and RBU also do well.
But since D#-{nDCG, RBP (p=0.85)} outperform RBU
even though they satisfy fewer axioms [Amigo+18],
measures that satisfy many axioms are not necessarily
the ones that align most with user perception.
77
Agreement with gold diversity (3)
D#-measures >> IA measures.
Use D# for diversity evaluation! 78
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
79
Conclusions: agreement with user preferences
• Adhoc measures (relevance preferences):
- iRBU (p=0.99) and nDCG perform best;
- RBP (p=0.99, 0.85), Q, and EBR also do well.
• Diversity measures (diversity preferences):
- D#-{nDCG, RBP (p=0.85)} perform best;
- Other D#-measures and RBU also do well.
- D# >> IA-measures.
- D#-{nDCG, RBP (p=0.85)} > RBU. Satisfying more
axioms may not mean better alignment with users.
80
Also, D# are more discriminative than IA!
Future work: close the gap
81
The present study: those who judged the documents
≠ those who performed SERP preference judgements.
Future work: let the same group of people perform both
doc preference AND SERP preference judgements
TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
82
Giving it all away at
http://waseda.box.com/SIGIR2019PACK
(a) topic-by-run matrices for all 30 evaluation
measures
(b) 1,127 topic-SERP-SERP triplets from INTENT-1
(c) two 15×1,127 user preference matrices
You can evaluate YOUR favourite adhoc and diversity
measures using our data with the NTCIR-9 INTENT-1
test collection + runs!
http://research.nii.ac.jp/ntcir/data/data-en.html
83
References (1)
[Agrawal+09] Diversifying Search Results. WSDM 2009.
[Amigo+18] An Axiomatic Analysis of Diversity Evaluation
Metrics: Introducting the Rank-Biased Utility Metric.
SIGIR 2018.
[Chapelle+09] Expected Reciprocal Rank for Graded
Relevance. CIKM 2009.
[Moffat+08] Rank-Biased Precision for Measurement of
Retrieval Effectiveness. ACM TOIS 27, 1.
[Sakai06SIGIR] Evaluating Evaluation Metrics based on
the Bootstrap, SIGIR 2006.
[Sakai+08EVIA] Modelling A User Population for
Designing Information Retrieval Metrics. EVIA 2008.
84
References (2)
[Sakai+11SIGIR] Evaluating Diversified Search Results
Using Per-Intent Graded Relevance. SIGIR 2011.
[Sakai12WWW] Evaluation with Informational and
Navigational Intents, WWW 2012.
[Sakai+13IRJ] Diversified Search Evaluation: Lessons from
the NTCIR-9 INTENT Task. Information Retrieval 16, 4.
[Sanderson+10] Do user preferences and evaluation
measures line up?. SIGIR 2010.
[Song+11] Overview of the NTCIR-9 INTENT Task. NTCIR-9.
[Zhai+03] Beyond Independent Relevance: Methods and
Evaluation Metrics for Subtopic Retrieval. SIGIR 2003.
85

More Related Content

Similar to Which Diversity Evaluation Measures Best Predict User Preferences

information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas papermelkamutesfay1
 
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET Journal
 
Rating System:Various rating algorithms Review.
Rating System:Various rating algorithms Review.Rating System:Various rating algorithms Review.
Rating System:Various rating algorithms Review.Scandala Tamang
 
Sociocast CF Benchmark
Sociocast CF BenchmarkSociocast CF Benchmark
Sociocast CF BenchmarkAlbert Azout
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsUniversity of Bergen
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopPranab Ghosh
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshBigDataCloud
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document usefulssuser3c3f88
 
Provider workshop 11.14.12
Provider workshop 11.14.12Provider workshop 11.14.12
Provider workshop 11.14.12progroup
 
Graph Analytics with Greenplum and Apache MADlib
Graph Analytics with Greenplum and Apache MADlibGraph Analytics with Greenplum and Apache MADlib
Graph Analytics with Greenplum and Apache MADlibVMware Tanzu
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overviewTetsuya Sakai
 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometricsDiane Talley
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR modelsNisha Arankandath
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 
Software project plannings
Software project planningsSoftware project plannings
Software project planningsAman Adhikari
 
Software project plannings
Software project planningsSoftware project plannings
Software project planningsAman Adhikari
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksMelissa Moody
 

Similar to Which Diversity Evaluation Measures Best Predict User Preferences (20)

Search quality in practice
Search quality in practiceSearch quality in practice
Search quality in practice
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
IRJET- Analysis of Chi-Square Independence Test for Naïve Bayes Feature Selec...
 
Rating System:Various rating algorithms Review.
Rating System:Various rating algorithms Review.Rating System:Various rating algorithms Review.
Rating System:Various rating algorithms Review.
 
Data mining with weka
Data mining with wekaData mining with weka
Data mining with weka
 
Sociocast CF Benchmark
Sociocast CF BenchmarkSociocast CF Benchmark
Sociocast CF Benchmark
 
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by Hadoop
 
Recommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab GhoshRecommendation Engine Powered by Hadoop - Pranab Ghosh
Recommendation Engine Powered by Hadoop - Pranab Ghosh
 
An introduction to R is a document useful
An introduction to R is a document usefulAn introduction to R is a document useful
An introduction to R is a document useful
 
Provider workshop 11.14.12
Provider workshop 11.14.12Provider workshop 11.14.12
Provider workshop 11.14.12
 
Graph Analytics with Greenplum and Apache MADlib
Graph Analytics with Greenplum and Apache MADlibGraph Analytics with Greenplum and Apache MADlib
Graph Analytics with Greenplum and Apache MADlib
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
R programming for psychometrics
R programming for psychometricsR programming for psychometrics
R programming for psychometrics
 
Performance evaluation of IR models
Performance evaluation of IR modelsPerformance evaluation of IR models
Performance evaluation of IR models
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Software project plannings
Software project planningsSoftware project plannings
Software project plannings
 
Software project plannings
Software project planningsSoftware project plannings
Software project plannings
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
 

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
assia2019
assia2019assia2019
assia2019
 
evia2019
evia2019evia2019
evia2019
 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
ictir2016
ictir2016ictir2016
ictir2016
 
ICTIR2016tutorial
ICTIR2016tutorialICTIR2016tutorial
ICTIR2016tutorial
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size Design
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Which Diversity Evaluation Measures Best Predict User Preferences

  • 1. Which Diversity Evaluation Measures Are “Good”? Preprint: http://waseda.box.com/sigir2019 This slide deck: http://www.slideshare.net/tetsuyasakai/sigir2019 Tetsuya Sakai and Zhaohao Zeng Waseda University, Japan tetsuyasakai@acm.org zhaohao@fuji.waseda.jp 24th July@SIGIR2019, Paris.1
  • 2. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 2
  • 3. System improvements that don’t matter to the user User’s perception of SERP quality Evaluation measure score System System System 3
  • 4. Measures that can’t detect differences that matter to the user User’s perception of SERP quality Evaluation measure score System System System 4
  • 5. We need good evaluation measures User’s perception of SERP quality Evaluation measure score System System System System “good” = align well with user perception 5
  • 6. Which of them are “good”? Adhoc IR • AP • nDCG • ERR • RBP • Q-measure • TBG • U-measure … Diversified IR • α-nDCG • Intent-Aware (IA) measures • D(#)-measures • RBU … Diversity measures tend to be complex. 6
  • 7. Measures derived from axioms (1) An axiom from [Amigo+18] : Given the above SERP, which of the two new reldocs should be appended to it? #reldocs for Intent i > #reldocs for Intent i’ reldoc for i reldoc for i’ SERP 7
  • 8. Measures derived from axioms (2) An axiom from [Amigo+18] : Given the above SERP, which of the two new reldocs should be appended to it? reldoc for i reldoc for i’ SERP This one because the SERP will be more balanced Axioms are useful for understanding the properties of measures. But… 8
  • 9. Measures derived from axioms (3) An axiom from [Amigo+18] : Assumptions: (1) binary intentwise relevance (2) flat intent probabilities (3) a document is never relevant to multiple intents reldoc for I’ SERP How practical are they? 9
  • 10. How much do axioms matter to real users? For designing a “good” measure, • Is each axiom necessary/practical? • Are the set of axioms sufficient? Let’s just collect lots of user SERP preferences and see which measures actually align with them! SERP A SERP B users (gold): > Measure 1: > Measure 2: < 10
  • 11. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 11
  • 12. Measures considered Adhoc IR • Q • ERR • EBR (new!) • RBP • iRBU (new!) • nDCG Diversified IR • D-{Q, ERR, EBR, RBP, nDCG} • D#-{Q, ERR, EBR, RBP, nDCG} • {Q, ERR, EBR, RBP, nDCG}-IA • RBU Normalised Cumulative Utility family 12
  • 13. Normalised Cumulative Utility (1) [Sakai+08EVIA] : r 1 2 3 : Population of users who scan the ranked list 13
  • 14. Normalised Cumulative Utility (2) : r 1 2 3 : Stopping probability at r Users who abandon the list at r=1 Users who abandon the list at r=3 14
  • 15. Normalised Cumulative Utility (3) : r 1 2 3 : Measure utility of this doc for this user group Measure utility of these docs for this user group Utility at r NCU is “expected utility” 15
  • 16. Q-measure is an NCU (1) • Suppose R=3 relevant (1 highly rel, 2 partially rel) docs are known. Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 33% of users 33% of users Nonrelevant Stopping probability distribution: uniform over relevant docs 33% of users Retrieved Not retrieved Partially rel: 1 16
  • 17. Q-measure is an NCU (2) • Suppose R=3 relevant (1 highly rel, 2 partially rel) docs are known. Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 33% of users 33% of users Nonrelevant BR(2) = 4/6 BR(5) = 6/10 Q = ( BR(2) + BR(5) + 0 ) / 3 = 0.422 Q generalizes AP by using the Blended Ratio instead of Prec as Utility 17
  • 18. BR combines Prec and Normalised Cumulative Gain (1) • Suppose R=3 relevant (1 highly rel, 2 partially rel) docs are known. Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 Nonrelevant Prec(2) = 1/2 Highly rel: 3 Partially rel: 1 Partially rel: 1 Ideal list cg(r) cg*(r) Cumulative gain 0 3 3 3 4 3 4 5 5 5 BR(2) = (1+3)/(2+4) = 4/6 with β=1 18
  • 19. BR combines Prec and Normalised Cumulative Gain (2) • Suppose R=3 relevant (1 highly rel, 2 partially rel) docs are known. Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 Nonrelevant Prec(5) = 2/5 Highly rel: 3 Partially rel: 1 Partially rel: 1 Ideal list cg(r) cg*(r) Cumulative gain 0 3 3 3 4 3 4 5 5 5 BR(5) = (2+4)/(5+5) = 6/10 with β=1 19
  • 20. Patience parameter β of BR (binary relevance environment) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 β=0.1 β=1 β=10 r1 <= R ⇒ BR(r1)=(1+β)/(r1+βr1)=1/r1 r1 > R ⇒ BR(r1)=(1+β)/(r1+βR) r1 : rank of the 1st relevant doc Large β ⇒ more tolerance to relevant docs at low ranks BR(r1) R=5 20
  • 21. ERR is an NCU (1) [Chapelle+09] Nonrelevant Highly rel: 3/4 Nonrelevant Partially rel: 1/4 Nonrelevant Stopping probability distribution: accommodates diminishing return 1 0 21
  • 22. ERR is an NCU (2) Nonrelevant Highly rel: 3/4 Nonrelevant Partially rel: 1/4 Nonrelevant Stopping probability distribution: accommodates diminishing return 0 1 3/4 1/4 P ERR (2) = 3/4 22
  • 23. ERR is an NCU (3) Nonrelevant Highly rel: 3/4 Nonrelevant Partially rel: 1/4 Nonrelevant Stopping probability distribution: accommodates diminishing return 0 1 3/4 1/4 1 0 0 1 1/4 3/4 P ERR (5) = 1/4 * 1/4 = 1/16 23
  • 24. ERR is an NCU (2) Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 Nonrelevant RR(2) = 1/2 RR(5) = 1/5 ERR = 3/4 * 1/2 + 1/16 * 1/5 = 0.388 P ERR (2) = 3/4 P ERR (5) = 1/16 Only the final doc is considered useful (binary relevance) 24
  • 25. New measure: EBR (Expected Blended Ratio) Measure Stopping probabilities Utility Q Uniform BR ERR Diminishing return RR EBR Diminishing return BR Only the final doc is considered useful (binary relevance) EBR utilises graded relevance for both 25
  • 26. RBP [Moffat+08] is an NCU Nonrelevant Highly rel: 3/3 Nonrelevant Partially rel: 1/3 Nonrelevant Stopping probability distribution: does not consider document relevance 1-p p 1-p p p p p P RBP (2) = p^(2-1) * (1-p) 1-p 1-p 1-p P RBP (5) = p^(5-1) * (1-p) 26
  • 27. New measure: intentwise Rank-Biased Utility (iRBU) Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 Nonrelevant p^2 p^5 iRBU = 3/4 * p^2 + 1/16 * p^5 P ERR (2) = 3/4 P ERR (5) = 1/16 Ignores relevance; inverse effort rather than utility 27
  • 28. Summary of adhoc measures (nDCG omitted) Measure Stopping probabilities Utility Q Uniform BR ERR Diminishing return RR EBR Diminishing return BR iRBU Diminishing return p^r iRBU is a component of RBU [Amigo+18], a diversity measure 28
  • 29. Measures considered Adhoc IR • Q • ERR • EBR (new!) • RBP • iRBU (new!) • nDCG Diversified IR • D-{Q, ERR, EBR, RBP, nDCG} • D#-{Q, ERR, EBR, RBP, nDCG} • {Q, ERR, EBR, RBP, nDCG}-IA • RBU 29
  • 30. Diversified search • Given an ambiguous/underspecified query, produce a single Search Engine Result Page that satisfies different user intents! • Challenge: balancing relevance and diversity SERP(SearchEngineResultPage) Highly relevant near the top Give more space to popular intents? Give more space to informational intents? Cover many intents 30
  • 31. Two different approaches to evaluating diversified search • Intent-Aware measures [Agrawal+09] (1) Compute a measure for each intent (2) Combine the measures using intent probabilities as weights • D(#)-measures [Sakai+11SIGIR] (1) Combine intentwise graded relevance with intent probabilities to compute the gain of each document (2) Construct an ideal list based on the gain, and then compute a graded relevance measure based on it 31
  • 33. D-measures (1) Intent i: “harry potter books” Pr(i|q) = 0.7 Partially rel:1 Highly rel:3 Perfect:7 Nonrel:0 Partially rel:1 Partially rel:1 Reldoc1 Reldoc2 Reldoc3 Per-intent gain values gi gj Intent j: “pottermore.com” Pr(j|q) = 0.3 R = 3 relevant documents 2 intents 33
  • 34. D-measures (2) Reldoc1 Reldoc2 Reldoc3 0.7*1+0.3*7=2.8 0.7*1+0.3*1=1.0 0.7*3+0.3*0=2.1 D-DCG* = 2.8 + 2.1/log2(2+1) +1.0/log2(3+1) = 4.62 Per-intent gain values gi gj R = 3 relevant documents 2 intents Intent i: “harry potter books” Pr(i|q) = 0.7 Intent j: “pottermore.com” Pr(j|q) = 0.3 Ideal list based on global gains Pr(i|q) gi + Pr(j|q) gj Partially rel:1 Highly rel:3 Perfect:7 Nonrel:0 Partially rel:1 Partially rel:1 34
  • 35. D-measures (3) nonrel nonrel 2.1 nonrel Reldoc1 Reldoc2 Reldoc3Reldoc2 Ideal list based on global gains Pr(i|q) gi + Pr(j|q) gj D-DCG = 2.1/log2(3+1) = 1.05 D-DCG* = 4.62 D-nDCG = D-DCG/D-DCG* = 0.23 Per-intent gain values gi gj SERP to be evaluated R = 3 relevant documents 2 intents Intent i: “harry potter books” Pr(i|q) = 0.7 Intent j: “pottermore.com” Pr(j|q) = 0.3 Partially rel:1 Highly rel:3 Perfect:7 Nonrel:0 Partially rel:1 Partially rel:1 0.7*1+0.3*7=2.8 0.7*1+0.3*1=1.0 0.7*3+0.3*0=2.1 35
  • 36. Intent recall (aka subtopic recall [Zhai+03] ) I-rec = #intents covered by SERP / #intents = 1/2 nonrel nonrel nonrel Reldoc2 Per-intent gain values gi gj R = 3 relevant documents 2 intents Reldoc1 Reldoc2 Reldoc3Only Intent i is covered by SERP Intent i: “harry potter books” Pr(i|q) = 0.7 Intent j: “pottermore.com” Pr(j|q) = 0.3 SERP to be evaluated Partially rel:1 Highly rel:3 Perfect:7 Nonrel:0 Partially rel:1 Partially rel:1 36
  • 37. D#-measure = γ I-rec + (1-γ) D-measure D#-nDCG contour lines Pure diversity Overall relevance Official results from the NTCIR-10 INTENT-2 task [Sakai+13IRJ] 37
  • 38. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 38
  • 39. NTCIR-9 INTENT-1 data [Song+11] • Corpus: clueweb09 Japanese • 100 topics • 10.91 intents/topic • 15 runs • 5-point relevance grades topic Intent 1 Intent 2 L4 (15) L3 (7) L3 (7) L0 (0) L1 (1) L1 (1) 0.30.7 Exponential gain value setting 39
  • 40. Creating graded adhoc qrels from the diversity qrels topic Intent 1 Intent 2 L4 (15) L3 (7) L3 (7) L0 (0) L1 (1) L1 (1) 0.30.7 Used for computing diversity measures for the INTENT-1 runs topic log2( 4+3 +1) = 3 ⇒ L3 (7) log2( 3+0 +1) = 2 ⇒ L2 (3) log2( 1+1 +1) = 1 ⇒ L1 (1) Used for computing adhoc measures for the INTENT-1 runs 40
  • 41. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 41
  • 42. Rank correlation System ranking by Measure X System ranking by Measure Y System A System B System O : System B System A System O : swap swap Quantify consistency with Kendall’s tau rank correlation (Does not tell us which measure is better) 42
  • 43. Prec is very different from other adhoc measures (highest tau: .572) 43
  • 44. ERR is slightly different from other adhoc measures 44
  • 45. EBR, nDCG, Q, RBP, iRBU are very similar 45
  • 46. I-rec is very different from D-measures (highest tau: .752) 46
  • 47. D-ERR is slightly different from other D-measures 47
  • 48. D-{nDCG, Q, RBP} are very similar 48
  • 49. RBU, {EBR, ERR}-IA are very similar since they all use intentwise diminishing return 49
  • 50. Rank correlation summary • Adhoc - Prec is different (so is ERR) - EBR, nDCG, Q, RBP, iRBU are similar • Diversity - I-rec is different (so is D-ERR) - D-{nDCG, Q, RBP} are similar - RBU, {EBR, ERR}-IA are similar So which measures are “good”? We can’t tell. 50
  • 51. Discriminative power [Sakai06SIGIR, Sakai12WWW] • Compute a Randomised Tukey HSD test p-value for every system pair (15*14/2=105 pairs) • Quantifies how often statistically significant results can be obtained Significance level: 5% 51
  • 52. Discriminative power: adhoc RBP, nDCG, iRBU, Q > EBR >> ERR >> Prec 52
  • 53. Discriminative power: diversity D#-measures >> IA measures. RBU is also discriminative. 53
  • 54. Discriminative power summary • Adhoc RBP, nDCG, iRBU, Q (39-42 sig. differences) > EBR >> ERR >> Prec (only 14 sig. differences) • Diversity D#-measures >> IA measures e.g. D#-EBR (48 sig. differences) >> EBR-IA (only 30 sig. differences) RBU is also discriminative But discriminative does not necessarily mean correct 54
  • 55. Unanimity (1) [Amigo+18] topic-SERP1-SERP2 triplet A topic-SERP1-SERP2 triplet B topic-SERP1-SERP2 triplet C > = < > < < > > > M1 M2 M3N = 3 A high-unanimity measure = one that agrees with many other measures re: SERP preferences 55
  • 56. Unanimity (2) topic-SERP1-SERP2 triplet A topic-SERP1-SERP2 triplet B topic-SERP1-SERP2 triplet C > = < > < < > > > M1 M2 M3N = 3 U({M1}) = {<A, GT>, <B, EQ>, <C, LT>} size[U({M1})] = 1+0.5+1 = 2.5 56
  • 57. Unanimity (3) topic-SERP1-SERP2 triplet A topic-SERP1-SERP2 triplet B topic-SERP1-SERP2 triplet C > = < > < < > > > M1 M2 M3N = 3 U({M1}) = {<A, GT>, <B, EQ>, <C, LT>} size[U({M1})] = 2.5 U({M2, M3}) = {<A, GT>} size[U({M2, M3})] = 1 M2 and M3 agree 57
  • 58. Unanimity (4) topic-SERP1-SERP2 triplet A topic-SERP1-SERP2 triplet B topic-SERP1-SERP2 triplet C > = < > < < > > > M1 M2 M3N = 3 U({M1}) = {<A, GT>, <B, EQ>, <C, LT>} size[U({M1})] = 2.5 U({M2, M3}) = {<A, GT>} size[U({M2, M3})] = 1 U({M1}) ∩ U({M2, M3}) = {<A, GT>} size[U({M1}) ∩ U({M2, M3})] =1 intersection All measures agree Unanimity of M1 = log2 [(1/3)/{(2.5/3)*(1/3)}] = 0.263 58
  • 59. Unanimity results Unlike [Amigo+18], RBU’s results are unremarkable… 59
  • 60. Unanimity summary • RBP, RBP-IA, D#-RBP (all with p=0.99) have high unanimity scores. • RBU’s scores are unremarkable. • Unanimity results depend heavily on the set of measures considered. • Is a measure that agrees with many other measures “good”? • Is a person that agrees with many other people “good”? 60
  • 61. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 61
  • 62. Offline results don’t really tell us which measures are “good” Let’s just collect lots of user SERP preferences and see which measures actually align with them! SERP A SERP B users (gold): > Measure 1: > Measure 2: < Kendall’s tau for Measure i = (#agreements_w_gold - #disagreements_wi_gold)/#pairs 62
  • 63. SERP pairs judged • 100 topics * 105 system pairs = 10,500 topic-SERP-SERP triplets ⇒ filter by ΔPrec>0.1 AND ΔI-rec>0.1 (SERP pairs should be different): 1,258 triplets ⇒ removed incomplete SERPs and problematic HTML files: 1,127 triplets were used. 63
  • 65. Judges • 15 Japanese-course computer science students at Waseda University • Presentation order of triplets were randomized • Presentation order of the relevance and diversity questions were also randomized • Instruction said judges were expected to process one triplet in 3 minutes on average – the actual time spent was 50.1 seconds. 65
  • 66. A note on [Sanderson+10] • They also collected SERP preference data, but a TREC web subtopic (not the entire topic) was given to each judge. • The subtopic-based preferences were aggregated to form the gold preference for the entire topic. • Hence their preference data is not about which SERP is more diversified. • Our diversity question “Which SERP is more likely to satisfy a higher number of users?” is new! 66
  • 67. Judgement reliability Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Relevance preferences Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Diversity preferences Krippendorff’s alpha = 0.406 Krippendorff’s alpha = 0.356 Substantial agreement beyond chance 67
  • 68. Did every judge do the job properly? YES Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Each cell is either LEFT, irrelevant, or RIGHT 14 judges 1,127 SERP pairs LEAVE OUT ONE ASSESSOR Alpha does not change much even if we remove Judge 01 68
  • 69. Full gold data Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Relevance preferences Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Diversity preferences Krippendorff’s alpha = 0.406 Krippendorff’s alpha = 0.356 full gold data full gold data 1,115 SERP pairs 1,119 SERP pairs Final label Final label Majority vote Majority vote 69
  • 70. High agreement gold data Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Relevance preferences Each cell is either LEFT, irrelevant, or RIGHT 15 judges 1,127 SERP pairs Diversity preferences high-agreement gold data high-agreement gold data 894 SERP pairs 897 SERP pairs Final label Final label Majority vote Majority vote Krippendorff’s alpha = 0.406 → 0.518 Krippendorff’s alpha = 0.356 → 0.453 Keep columns where at least 9/15 agreed 70
  • 71. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 71
  • 72. Agreement with gold relevance (1) 72
  • 73. Agreement with gold relevance (2) iRBU (p=0.99) and nDCG perform best; RBP (p=0.99, 0.85), Q, and EBR also do well. 73
  • 74. Agreement with gold relevance (3) Prec does not perform well ⇒ users care about ranking; ERR performs worst ⇒ users care about the number of relevant docs. 74
  • 75. iRBU (p=0.99) performs surprisingly well! Suggests that it may be good to ignore relevance in the Utility function! Nonrelevant Highly rel: 3 Nonrelevant Partially rel: 1 Nonrelevant 0.99^2 = 0.980 0.99^5 =0.951 iRBU = 3/4 * 0.99^2 + 1/16 * 0.99^5 = 0.795 P ERR (2) = 3/4 P ERR (5) = 1/16 75 Ignores relevance; inverse effort rather than utility
  • 76. Agreement with gold diversity (1) 76
  • 77. Agreement with gold diversity (2) D#-{nDCG, RBP (p=0.85)} perform best; Other D#-measures and RBU also do well. But since D#-{nDCG, RBP (p=0.85)} outperform RBU even though they satisfy fewer axioms [Amigo+18], measures that satisfy many axioms are not necessarily the ones that align most with user perception. 77
  • 78. Agreement with gold diversity (3) D#-measures >> IA measures. Use D# for diversity evaluation! 78
  • 79. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 79
  • 80. Conclusions: agreement with user preferences • Adhoc measures (relevance preferences): - iRBU (p=0.99) and nDCG perform best; - RBP (p=0.99, 0.85), Q, and EBR also do well. • Diversity measures (diversity preferences): - D#-{nDCG, RBP (p=0.85)} perform best; - Other D#-measures and RBU also do well. - D# >> IA-measures. - D#-{nDCG, RBP (p=0.85)} > RBU. Satisfying more axioms may not mean better alignment with users. 80 Also, D# are more discriminative than IA!
  • 81. Future work: close the gap 81 The present study: those who judged the documents ≠ those who performed SERP preference judgements. Future work: let the same group of people perform both doc preference AND SERP preference judgements
  • 82. TALK OUTLINE 1. Motivation 2. Adhoc and Diversity measures 3. Test collection and runs 4. Offline comparisons 5. Collecting users’ SERP preferences 6. Main results 7. Conclusions and future work 8. Resources 82
  • 83. Giving it all away at http://waseda.box.com/SIGIR2019PACK (a) topic-by-run matrices for all 30 evaluation measures (b) 1,127 topic-SERP-SERP triplets from INTENT-1 (c) two 15×1,127 user preference matrices You can evaluate YOUR favourite adhoc and diversity measures using our data with the NTCIR-9 INTENT-1 test collection + runs! http://research.nii.ac.jp/ntcir/data/data-en.html 83
  • 84. References (1) [Agrawal+09] Diversifying Search Results. WSDM 2009. [Amigo+18] An Axiomatic Analysis of Diversity Evaluation Metrics: Introducting the Rank-Biased Utility Metric. SIGIR 2018. [Chapelle+09] Expected Reciprocal Rank for Graded Relevance. CIKM 2009. [Moffat+08] Rank-Biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS 27, 1. [Sakai06SIGIR] Evaluating Evaluation Metrics based on the Bootstrap, SIGIR 2006. [Sakai+08EVIA] Modelling A User Population for Designing Information Retrieval Metrics. EVIA 2008. 84
  • 85. References (2) [Sakai+11SIGIR] Evaluating Diversified Search Results Using Per-Intent Graded Relevance. SIGIR 2011. [Sakai12WWW] Evaluation with Informational and Navigational Intents, WWW 2012. [Sakai+13IRJ] Diversified Search Evaluation: Lessons from the NTCIR-9 INTENT Task. Information Retrieval 16, 4. [Sanderson+10] Do user preferences and evaluation measures line up?. SIGIR 2010. [Song+11] Overview of the NTCIR-9 INTENT Task. NTCIR-9. [Zhai+03] Beyond Independent Relevance: Methods and Evaluation Metrics for Subtopic Retrieval. SIGIR 2003. 85