This document discusses evaluating measures for diversified search results. It introduces several existing measures for ad-hoc retrieval and diversified retrieval, and proposes some new measures. It describes using data from past NTCIR evaluations involving diversity search to compare these measures offline. The goal is to determine which measures best align with users' preferences for search result pages by collecting users' direct feedback on sample search results.
Ensuring Technical Readiness For Copilot in Microsoft 365
Which Diversity Evaluation Measures Best Predict User Preferences
1. Which Diversity Evaluation
Measures Are “Good”?
Preprint: http://waseda.box.com/sigir2019
This slide deck: http://www.slideshare.net/tetsuyasakai/sigir2019
Tetsuya Sakai and Zhaohao Zeng
Waseda University, Japan
tetsuyasakai@acm.org
zhaohao@fuji.waseda.jp
24th July@SIGIR2019, Paris.1
2. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
2
3. System improvements that don’t
matter to the user
User’s perception
of SERP quality
Evaluation measure score
System System System
3
4. Measures that can’t detect
differences that matter to the user
User’s perception
of SERP quality
Evaluation measure score
System
System
System
4
5. We need good evaluation measures
User’s perception
of SERP quality
Evaluation measure score
System
System
System
System
“good” =
align well with
user perception
5
6. Which of them are “good”?
Adhoc IR
• AP
• nDCG
• ERR
• RBP
• Q-measure
• TBG
• U-measure …
Diversified IR
• α-nDCG
• Intent-Aware (IA)
measures
• D(#)-measures
• RBU …
Diversity measures
tend to be complex.
6
7. Measures derived from axioms (1)
An axiom from [Amigo+18] :
Given the above SERP, which of the two new reldocs
should be appended to it?
#reldocs for Intent i
>
#reldocs for Intent i’
reldoc for i reldoc for i’
SERP
7
8. Measures derived from axioms (2)
An axiom from [Amigo+18] :
Given the above SERP, which of the two new reldocs
should be appended to it?
reldoc for i
reldoc for i’
SERP
This one because the
SERP will be more
balanced
Axioms are useful for understanding
the properties of measures. But… 8
9. Measures derived from axioms (3)
An axiom from [Amigo+18] :
Assumptions:
(1) binary intentwise relevance
(2) flat intent probabilities
(3) a document is never relevant to multiple intents
reldoc for I’
SERP
How practical are they? 9
10. How much do axioms matter to
real users?
For designing a “good” measure,
• Is each axiom necessary/practical?
• Are the set of axioms sufficient?
Let’s just collect lots of user SERP preferences
and see which measures actually align with them!
SERP A SERP B
users (gold): >
Measure 1: >
Measure 2: < 10
11. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
11
14. Normalised Cumulative Utility (2)
:
r
1
2
3
:
Stopping probability at r
Users who abandon the list at r=1
Users who abandon the list at r=3
14
15. Normalised Cumulative Utility (3)
:
r
1
2
3
:
Measure utility of
this doc for this user
group
Measure utility of
these docs for this
user group
Utility at r
NCU is “expected utility”
15
16. Q-measure is an NCU (1)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Partially rel: 1
16
17. Q-measure is an NCU (2)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
BR(2)
= 4/6
BR(5)
= 6/10
Q
= ( BR(2) + BR(5) + 0 ) / 3
= 0.422
Q generalizes AP by
using the Blended Ratio
instead of Prec as Utility
17
24. ERR is an NCU (2)
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
RR(2)
= 1/2
RR(5)
= 1/5
ERR
= 3/4 * 1/2 + 1/16 * 1/5
= 0.388
P ERR (2) = 3/4
P ERR (5) = 1/16
Only the final doc is
considered useful
(binary relevance)
24
25. New measure:
EBR (Expected Blended Ratio)
Measure Stopping
probabilities
Utility
Q Uniform BR
ERR Diminishing return RR
EBR Diminishing return BR
Only the final doc is
considered useful
(binary relevance)
EBR utilises graded relevance for both
25
26. RBP [Moffat+08] is an NCU
Nonrelevant
Highly rel:
3/3
Nonrelevant
Partially rel:
1/3
Nonrelevant
Stopping probability distribution:
does not consider document relevance
1-p
p 1-p
p
p
p
p
P RBP (2) = p^(2-1) * (1-p)
1-p
1-p
1-p
P RBP (5) = p^(5-1) * (1-p)
26
27. New measure: intentwise
Rank-Biased Utility (iRBU)
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
p^2
p^5
iRBU
= 3/4 * p^2 + 1/16 * p^5
P ERR (2) = 3/4
P ERR (5) = 1/16
Ignores relevance;
inverse effort rather
than utility
27
28. Summary of adhoc measures
(nDCG omitted)
Measure Stopping
probabilities
Utility
Q Uniform BR
ERR Diminishing return RR
EBR Diminishing return BR
iRBU Diminishing return p^r
iRBU is a component of RBU [Amigo+18],
a diversity measure 28
30. Diversified search
• Given an ambiguous/underspecified query, produce a
single Search Engine Result Page that satisfies
different user intents!
• Challenge: balancing relevance and diversity
SERP(SearchEngineResultPage)
Highly relevant
near the top
Give more
space to
popular intents?
Give more space
to informational
intents?
Cover many
intents
30
31. Two different approaches to
evaluating diversified search
• Intent-Aware measures [Agrawal+09]
(1) Compute a measure for each intent
(2) Combine the measures using intent probabilities as
weights
• D(#)-measures [Sakai+11SIGIR]
(1) Combine intentwise graded relevance with intent
probabilities to compute the gain of each document
(2) Construct an ideal list based on the gain, and then
compute a graded relevance measure based on it
31
40. Creating graded adhoc qrels
from the diversity qrels
topic
Intent 1 Intent 2
L4 (15) L3 (7)
L3 (7) L0 (0)
L1 (1) L1 (1)
0.30.7
Used for computing diversity measures
for the INTENT-1 runs
topic
log2( 4+3 +1) = 3 ⇒ L3 (7)
log2( 3+0 +1) = 2 ⇒ L2 (3)
log2( 1+1 +1) = 1 ⇒ L1 (1)
Used for computing adhoc measures
for the INTENT-1 runs
40
41. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
41
42. Rank correlation
System ranking by Measure X System ranking by Measure Y
System A
System B
System O
:
System B
System A
System O
:
swap
swap
Quantify consistency with
Kendall’s tau rank correlation
(Does not tell us which measure is better)
42
43. Prec is very different
from other adhoc measures
(highest tau: .572)
43
50. Rank correlation summary
• Adhoc
- Prec is different (so is ERR)
- EBR, nDCG, Q, RBP, iRBU are similar
• Diversity
- I-rec is different (so is D-ERR)
- D-{nDCG, Q, RBP} are similar
- RBU, {EBR, ERR}-IA are similar
So which measures are “good”? We can’t tell.
50
51. Discriminative power
[Sakai06SIGIR, Sakai12WWW]
• Compute a Randomised Tukey HSD test p-value for
every system pair (15*14/2=105 pairs)
• Quantifies how often statistically significant results
can be obtained
Significance level: 5%
51
54. Discriminative power summary
• Adhoc
RBP, nDCG, iRBU, Q (39-42 sig. differences)
> EBR >> ERR >> Prec (only 14 sig. differences)
• Diversity
D#-measures >> IA measures
e.g. D#-EBR (48 sig. differences)
>> EBR-IA (only 30 sig. differences)
RBU is also discriminative
But discriminative does not necessarily mean correct
54
55. Unanimity (1) [Amigo+18]
topic-SERP1-SERP2 triplet A
topic-SERP1-SERP2 triplet B
topic-SERP1-SERP2 triplet C
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
A high-unanimity measure =
one that agrees with many other measures
re: SERP preferences
55
60. Unanimity summary
• RBP, RBP-IA, D#-RBP (all with p=0.99) have high
unanimity scores.
• RBU’s scores are unremarkable.
• Unanimity results depend heavily on the set of
measures considered.
• Is a measure that agrees with many other measures
“good”?
• Is a person that agrees with many other people
“good”?
60
61. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
61
62. Offline results don’t really tell us
which measures are “good”
Let’s just collect lots of user SERP preferences
and see which measures actually align with them!
SERP A SERP B
users (gold): >
Measure 1: >
Measure 2: <
Kendall’s tau for Measure i =
(#agreements_w_gold - #disagreements_wi_gold)/#pairs
62
63. SERP pairs judged
• 100 topics * 105 system pairs
= 10,500 topic-SERP-SERP triplets
⇒ filter by ΔPrec>0.1 AND ΔI-rec>0.1 (SERP pairs
should be different): 1,258 triplets
⇒ removed incomplete SERPs and problematic
HTML files: 1,127 triplets were used.
63
65. Judges
• 15 Japanese-course computer science students at
Waseda University
• Presentation order of triplets were randomized
• Presentation order of the relevance and diversity
questions were also randomized
• Instruction said judges were expected to process
one triplet in 3 minutes on average – the actual
time spent was 50.1 seconds.
65
66. A note on [Sanderson+10]
• They also collected SERP preference data, but a
TREC web subtopic (not the entire topic) was given
to each judge.
• The subtopic-based preferences were aggregated
to form the gold preference for the entire topic.
• Hence their preference data is not about which
SERP is more diversified.
• Our diversity question “Which SERP is more likely
to satisfy a higher number of users?” is new!
66
67. Judgement reliability
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Diversity preferences
Krippendorff’s alpha
= 0.406
Krippendorff’s alpha
= 0.356
Substantial agreement beyond chance
67
68. Did every judge do the job
properly? YES
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Each cell is either
LEFT, irrelevant, or RIGHT
14 judges
1,127 SERP pairs
LEAVE OUT ONE ASSESSOR
Alpha does not
change much even if
we remove Judge 01
68
69. Full gold data
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Diversity preferences
Krippendorff’s alpha
= 0.406
Krippendorff’s alpha
= 0.356
full gold data full gold data
1,115 SERP pairs 1,119 SERP pairs
Final label Final label
Majority vote Majority vote
69
70. High agreement gold data
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Diversity preferences
high-agreement gold
data
high-agreement gold
data
894 SERP pairs 897 SERP pairs
Final label Final label
Majority vote Majority vote
Krippendorff’s alpha
= 0.406 → 0.518
Krippendorff’s alpha
= 0.356 → 0.453
Keep columns
where at least
9/15 agreed
70
71. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
71
73. Agreement with gold relevance (2)
iRBU (p=0.99) and nDCG perform best;
RBP (p=0.99, 0.85), Q, and EBR also do well.
73
74. Agreement with gold relevance (3)
Prec does not perform well
⇒ users care about ranking;
ERR performs worst
⇒ users care about the number of relevant docs.
74
75. iRBU (p=0.99) performs surprisingly well!
Suggests that it may be good to ignore
relevance in the Utility function!
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
0.99^2
= 0.980
0.99^5
=0.951
iRBU
= 3/4 * 0.99^2
+ 1/16 * 0.99^5
= 0.795
P ERR (2) = 3/4
P ERR (5) = 1/16
75
Ignores relevance;
inverse effort rather
than utility
77. Agreement with gold diversity (2)
D#-{nDCG, RBP (p=0.85)} perform best;
Other D#-measures and RBU also do well.
But since D#-{nDCG, RBP (p=0.85)} outperform RBU
even though they satisfy fewer axioms [Amigo+18],
measures that satisfy many axioms are not necessarily
the ones that align most with user perception.
77
78. Agreement with gold diversity (3)
D#-measures >> IA measures.
Use D# for diversity evaluation! 78
79. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
79
80. Conclusions: agreement with user preferences
• Adhoc measures (relevance preferences):
- iRBU (p=0.99) and nDCG perform best;
- RBP (p=0.99, 0.85), Q, and EBR also do well.
• Diversity measures (diversity preferences):
- D#-{nDCG, RBP (p=0.85)} perform best;
- Other D#-measures and RBU also do well.
- D# >> IA-measures.
- D#-{nDCG, RBP (p=0.85)} > RBU. Satisfying more
axioms may not mean better alignment with users.
80
Also, D# are more discriminative than IA!
81. Future work: close the gap
81
The present study: those who judged the documents
≠ those who performed SERP preference judgements.
Future work: let the same group of people perform both
doc preference AND SERP preference judgements
82. TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
82
83. Giving it all away at
http://waseda.box.com/SIGIR2019PACK
(a) topic-by-run matrices for all 30 evaluation
measures
(b) 1,127 topic-SERP-SERP triplets from INTENT-1
(c) two 15×1,127 user preference matrices
You can evaluate YOUR favourite adhoc and diversity
measures using our data with the NTCIR-9 INTENT-1
test collection + runs!
http://research.nii.ac.jp/ntcir/data/data-en.html
83
84. References (1)
[Agrawal+09] Diversifying Search Results. WSDM 2009.
[Amigo+18] An Axiomatic Analysis of Diversity Evaluation
Metrics: Introducting the Rank-Biased Utility Metric.
SIGIR 2018.
[Chapelle+09] Expected Reciprocal Rank for Graded
Relevance. CIKM 2009.
[Moffat+08] Rank-Biased Precision for Measurement of
Retrieval Effectiveness. ACM TOIS 27, 1.
[Sakai06SIGIR] Evaluating Evaluation Metrics based on
the Bootstrap, SIGIR 2006.
[Sakai+08EVIA] Modelling A User Population for
Designing Information Retrieval Metrics. EVIA 2008.
84
85. References (2)
[Sakai+11SIGIR] Evaluating Diversified Search Results
Using Per-Intent Graded Relevance. SIGIR 2011.
[Sakai12WWW] Evaluation with Informational and
Navigational Intents, WWW 2012.
[Sakai+13IRJ] Diversified Search Evaluation: Lessons from
the NTCIR-9 INTENT Task. Information Retrieval 16, 4.
[Sanderson+10] Do user preferences and evaluation
measures line up?. SIGIR 2010.
[Song+11] Overview of the NTCIR-9 INTENT Task. NTCIR-9.
[Zhai+03] Beyond Independent Relevance: Methods and
Evaluation Metrics for Subtopic Retrieval. SIGIR 2003.
85