Which Diversity Evaluation Measures Best Predict User Preferences

Which Diversity Evaluation
Measures Are “Good”?
Preprint: http://waseda.box.com/sigir2019
This slide deck: http://www.slideshare.net/tetsuyasakai/sigir2019
Tetsuya Sakai and Zhaohao Zeng
Waseda University, Japan
tetsuyasakai@acm.org
zhaohao@fuji.waseda.jp
24th July@SIGIR2019, Paris.1

TALK OUTLINE
1. Motivation
2. Adhoc and Diversity measures
3. Test collection and runs
4. Offline comparisons
5. Collecting users’ SERP preferences
6. Main results
7. Conclusions and future work
8. Resources
2

System improvements that don’t
matter to the user
User’s perception
of SERP quality
Evaluation measure score
System System System
3

Measures that can’t detect
differences that matter to the user
User’s perception
of SERP quality
System
System
System
4

We need good evaluation measures
User’s perception
of SERP quality
System
System
System
System
“good” =
align well with
user perception
5

Which of them are “good”?
Adhoc IR
• AP
• nDCG
• ERR
• RBP
• Q-measure
• TBG
• U-measure …
Diversified IR
• α-nDCG
• Intent-Aware (IA)
measures
• D(#)-measures
• RBU …
Diversity measures
tend to be complex.
6

Measures derived from axioms (1)
An axiom from [Amigo+18] :
Given the above SERP, which of the two new reldocs
should be appended to it?
#reldocs for Intent i
>
#reldocs for Intent i’
reldoc for i reldoc for i’
SERP
7

Given the above SERP, which of the two new reldocs
should be appended to it?
reldoc for i
reldoc for i’
SERP
This one because the
SERP will be more
balanced
Axioms are useful for understanding
the properties of measures. But… 8

Assumptions:
(1) binary intentwise relevance
(2) flat intent probabilities
(3) a document is never relevant to multiple intents
reldoc for I’
SERP
How practical are they? 9

How much do axioms matter to
real users?
For designing a “good” measure,
• Is each axiom necessary/practical?
• Are the set of axioms sufficient?
Let’s just collect lots of user SERP preferences
and see which measures actually align with them!
SERP A SERP B
users (gold): >
Measure 1: >
Measure 2: < 10

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
11

Measures considered
Adhoc IR
• Q
• ERR
• EBR (new!)
• RBP
• iRBU (new!)
• nDCG
Diversified IR
• D-{Q, ERR, EBR, RBP,
nDCG}
• D#-{Q, ERR, EBR, RBP,
nDCG}
• {Q, ERR, EBR, RBP,
nDCG}-IA
• RBU
Normalised
Cumulative
Utility family
12

Normalised Cumulative Utility (1)
[Sakai+08EVIA]
:
r
1
2
3
:
Population of
users who scan
the ranked list
13

:
r
1
2
3
:
Stopping probability at r
Users who abandon the list at r=1
Users who abandon the list at r=3
14

:
r
1
2
3
:
Measure utility of
this doc for this user
group
Measure utility of
these docs for this
user group
Utility at r
NCU is “expected utility”
15

Q-measure is an NCU (1)
• Suppose R=3 relevant (1 highly rel, 2 partially rel)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
Stopping
probability
distribution:
uniform
over
relevant
docs
33% of
users
Retrieved
Not retrieved
Partially rel: 1
16

Q-measure is an NCU (2)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
33% of
users
33% of
users Nonrelevant
BR(2)
= 4/6
BR(5)
= 6/10
Q
= ( BR(2) + BR(5) + 0 ) / 3
= 0.422
Q generalizes AP by
using the Blended Ratio
instead of Prec as Utility
17

BR combines Prec and Normalised
Cumulative Gain (1)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
Prec(2)
= 1/2
Highly rel: 3
Partially rel: 1
Partially rel: 1
Ideal list
cg(r) cg*(r)
Cumulative gain
0
3
3
3
4
3
4
5
5
5
BR(2)
= (1+3)/(2+4)
= 4/6
with β=1
18

BR combines Prec and Normalised
Cumulative Gain (2)
docs are known.
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant Prec(5)
= 2/5
Highly rel: 3
Partially rel: 1
Partially rel: 1
Ideal list
cg(r) cg*(r)
Cumulative gain
0
3
3
3
4
3
4
5
5
5
BR(5)
= (2+4)/(5+5)
= 6/10
with β=1
19

Patience parameter β of BR
(binary relevance environment)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
β=0.1
β=1
β=10
r1 <= R ⇒
BR(r1)=(1+β)/(r1+βr1)=1/r1
r1 > R ⇒
BR(r1)=(1+β)/(r1+βR)
r1 : rank of the 1st relevant doc
Large β ⇒ more
tolerance to relevant
docs at low ranks
BR(r1) R=5
20

ERR is an NCU (1) [Chapelle+09]
Nonrelevant
Highly rel:
3/4
Nonrelevant
Partially rel:
1/4
Nonrelevant
Stopping probability distribution:
accommodates diminishing return
1
0
21

ERR is an NCU (2)
Nonrelevant
Highly rel:
3/4
Nonrelevant
Partially rel:
1/4
Nonrelevant
0
1 3/4
1/4
P ERR (2) = 3/4
22

ERR is an NCU (3)
Nonrelevant
Highly rel:
3/4
Nonrelevant
Partially rel:
1/4
Nonrelevant
0
1 3/4
1/4
1
0
0
1 1/4
3/4
P ERR (5) = 1/4 * 1/4 = 1/16
23

ERR is an NCU (2)
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
RR(2)
= 1/2
RR(5)
= 1/5
ERR
= 3/4 * 1/2 + 1/16 * 1/5
= 0.388
P ERR (2) = 3/4
P ERR (5) = 1/16
Only the final doc is
considered useful
(binary relevance)
24

New measure:
EBR (Expected Blended Ratio)
Measure Stopping
probabilities
Utility
Q Uniform BR
ERR Diminishing return RR
EBR Diminishing return BR
Only the final doc is
considered useful
(binary relevance)
EBR utilises graded relevance for both
25

RBP [Moffat+08] is an NCU
Nonrelevant
Highly rel:
3/3
Nonrelevant
Partially rel:
1/3
Nonrelevant
does not consider document relevance
1-p
p 1-p
p
p
p
p
P RBP (2) = p^(2-1) * (1-p)
1-p
1-p
1-p
P RBP (5) = p^(5-1) * (1-p)
26

New measure: intentwise
Rank-Biased Utility (iRBU)
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
p^2
p^5
iRBU
= 3/4 * p^2 + 1/16 * p^5
P ERR (2) = 3/4
P ERR (5) = 1/16
Ignores relevance;
inverse effort rather
than utility
27

Summary of adhoc measures
(nDCG omitted)
Measure Stopping
probabilities
Utility
Q Uniform BR
ERR Diminishing return RR
EBR Diminishing return BR
iRBU Diminishing return p^r
iRBU is a component of RBU [Amigo+18],
a diversity measure 28

Measures considered
Adhoc IR
• Q
• ERR
• EBR (new!)
• RBP
• iRBU (new!)
• nDCG
Diversified IR
• D-{Q, ERR, EBR, RBP,
nDCG}
• D#-{Q, ERR, EBR, RBP,
nDCG}
• {Q, ERR, EBR, RBP,
nDCG}-IA
• RBU
29

Diversified search
• Given an ambiguous/underspecified query, produce a
single Search Engine Result Page that satisfies
different user intents!
• Challenge: balancing relevance and diversity
SERP(SearchEngineResultPage)
Highly relevant
near the top
Give more
space to
popular intents?
Give more space
to informational
intents?
Cover many
intents
30

Two different approaches to
evaluating diversified search
• Intent-Aware measures [Agrawal+09]
(1) Compute a measure for each intent
(2) Combine the measures using intent probabilities as
weights
• D(#)-measures [Sakai+11SIGIR]
(1) Combine intentwise graded relevance with intent
probabilities to compute the gain of each document
(2) Construct an ideal list based on the gain, and then
compute a graded relevance measure based on it
31

RBU [Amigo+18]
Intent-aware
iRBU
Effort
penalty
We let e=0.01 throughout this study
32

D-measures (1)
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
Partially rel:1 Partially rel:1
Reldoc1
Reldoc2
Reldoc3
Per-intent gain values
gi gj
Intent j:
“pottermore.com”
Pr(j|q) = 0.3
R = 3 relevant
documents
2 intents
33

D-measures (2)
Reldoc1
Reldoc2
Reldoc3
0.7*1+0.3*7=2.8
0.7*1+0.3*1=1.0
0.7*3+0.3*0=2.1
D-DCG*
= 2.8 + 2.1/log2(2+1) +1.0/log2(3+1)
= 4.62
gi gj
R = 3 relevant
documents
2 intents
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
Pr(j|q) = 0.3
Ideal list based on
global gains
Pr(i|q) gi + Pr(j|q) gj
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
34

D-measures (3)
nonrel
nonrel
2.1
nonrel
Reldoc1
Reldoc2
Reldoc3Reldoc2
Ideal list based on
global gains
Pr(i|q) gi + Pr(j|q) gj
D-DCG
= 2.1/log2(3+1)
= 1.05
D-DCG*
= 4.62
D-nDCG =
D-DCG/D-DCG*
= 0.23
gi gj
SERP to be
evaluated
R = 3 relevant
documents
2 intents
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
Pr(j|q) = 0.3
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
0.7*1+0.3*7=2.8
0.7*1+0.3*1=1.0
0.7*3+0.3*0=2.1
35

Intent recall (aka
subtopic recall [Zhai+03] )
I-rec =
#intents covered by SERP / #intents
= 1/2
nonrel
nonrel
nonrel
Reldoc2
gi gj
R = 3 relevant
documents
2 intents
Reldoc1
Reldoc2
Reldoc3Only Intent i is
covered by SERP
Intent i:
“harry potter
books”
Pr(i|q) = 0.7
Intent j:
Pr(j|q) = 0.3
SERP to be
evaluated
Partially rel:1
Highly rel:3
Perfect:7
Nonrel:0
36

D#-measure = γ I-rec + (1-γ) D-measure
D#-nDCG
contour
lines
Pure
diversity
Overall
relevance
Official results from the NTCIR-10
INTENT-2 task [Sakai+13IRJ]
37

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
38

NTCIR-9 INTENT-1 data [Song+11]
• Corpus: clueweb09 Japanese
• 100 topics
• 10.91 intents/topic
• 15 runs
• 5-point relevance grades
topic
Intent 1 Intent 2
L4 (15) L3 (7)
L3 (7) L0 (0)
L1 (1) L1 (1)
0.30.7
Exponential gain
value setting
39

Creating graded adhoc qrels
from the diversity qrels
topic
Intent 1 Intent 2
L4 (15) L3 (7)
L3 (7) L0 (0)
L1 (1) L1 (1)
0.30.7
Used for computing diversity measures
for the INTENT-1 runs
topic
log2( 4+3 +1) = 3 ⇒ L3 (7)
log2( 3+0 +1) = 2 ⇒ L2 (3)
log2( 1+1 +1) = 1 ⇒ L1 (1)
Used for computing adhoc measures
for the INTENT-1 runs
40

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
41

Rank correlation
System ranking by Measure X System ranking by Measure Y
System A
System B
System O
:
System B
System A
System O
:
swap
swap
Quantify consistency with
Kendall’s tau rank correlation
(Does not tell us which measure is better)
42

Prec is very different
from other adhoc measures
(highest tau: .572)
43

ERR is slightly different
from other adhoc measures
44

EBR, nDCG, Q, RBP, iRBU
are very similar
45

I-rec is very different
from D-measures
(highest tau: .752)
46

D-ERR is slightly different
from other D-measures
47

D-{nDCG, Q, RBP}
are very similar
48

RBU, {EBR, ERR}-IA
are very similar
since they all use
intentwise
diminishing return
49

Rank correlation summary
• Adhoc
- Prec is different (so is ERR)
- EBR, nDCG, Q, RBP, iRBU are similar
• Diversity
- I-rec is different (so is D-ERR)
- D-{nDCG, Q, RBP} are similar
- RBU, {EBR, ERR}-IA are similar
So which measures are “good”? We can’t tell.
50

Discriminative power
[Sakai06SIGIR, Sakai12WWW]
• Compute a Randomised Tukey HSD test p-value for
every system pair (15*14/2=105 pairs)
• Quantifies how often statistically significant results
can be obtained
Significance level: 5%
51

Discriminative power: adhoc
RBP, nDCG, iRBU, Q > EBR >> ERR >> Prec
52

Discriminative power: diversity
D#-measures >> IA measures.
RBU is also discriminative. 53

Discriminative power summary
• Adhoc
RBP, nDCG, iRBU, Q (39-42 sig. differences)
> EBR >> ERR >> Prec (only 14 sig. differences)
• Diversity
D#-measures >> IA measures
e.g. D#-EBR (48 sig. differences)
>> EBR-IA (only 30 sig. differences)
RBU is also discriminative
But discriminative does not necessarily mean correct
54

Unanimity (1) [Amigo+18]
topic-SERP1-SERP2 triplet A
topic-SERP1-SERP2 triplet B
topic-SERP1-SERP2 triplet C
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
A high-unanimity measure =
one that agrees with many other measures
re: SERP preferences
55

Unanimity (2)
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
U({M1}) =
{<A, GT>,
<B, EQ>,
<C, LT>}
size[U({M1})]
= 1+0.5+1
= 2.5
56

Unanimity (3)
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
U({M1}) =
{<A, GT>,
<B, EQ>,
<C, LT>}
size[U({M1})]
= 2.5
U({M2, M3}) =
{<A, GT>}
size[U({M2, M3})]
= 1
M2 and M3 agree
57

Unanimity (4)
>
=
<
>
<
<
>
>
>
M1 M2 M3N = 3
U({M1}) =
{<A, GT>,
<B, EQ>,
<C, LT>}
size[U({M1})]
= 2.5
U({M2, M3}) =
{<A, GT>}
size[U({M2, M3})]
= 1
U({M1}) ∩ U({M2, M3}) =
{<A, GT>}
size[U({M1}) ∩ U({M2, M3})]
=1
intersection
All measures agree
Unanimity of M1 = log2 [(1/3)/{(2.5/3)*(1/3)}] = 0.263
58

Unanimity results
Unlike [Amigo+18], RBU’s
results are unremarkable…
59

Unanimity summary
• RBP, RBP-IA, D#-RBP (all with p=0.99) have high
unanimity scores.
• RBU’s scores are unremarkable.
• Unanimity results depend heavily on the set of
measures considered.
• Is a measure that agrees with many other measures
“good”?
• Is a person that agrees with many other people
“good”?
60

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
61

Offline results don’t really tell us
which measures are “good”
Let’s just collect lots of user SERP preferences
and see which measures actually align with them!
SERP A SERP B
users (gold): >
Measure 1: >
Measure 2: <
Kendall’s tau for Measure i =
(#agreements_w_gold - #disagreements_wi_gold)/#pairs
62

SERP pairs judged
• 100 topics * 105 system pairs
= 10,500 topic-SERP-SERP triplets
⇒ filter by ΔPrec>0.1 AND ΔI-rec>0.1 (SERP pairs
should be different): 1,258 triplets
⇒ removed incomplete SERPs and problematic
HTML files: 1,127 triplets were used.
63

Judges
• 15 Japanese-course computer science students at
Waseda University
• Presentation order of triplets were randomized
• Presentation order of the relevance and diversity
questions were also randomized
• Instruction said judges were expected to process
one triplet in 3 minutes on average – the actual
time spent was 50.1 seconds.
65

A note on [Sanderson+10]
• They also collected SERP preference data, but a
TREC web subtopic (not the entire topic) was given
to each judge.
• The subtopic-based preferences were aggregated
to form the gold preference for the entire topic.
• Hence their preference data is not about which
SERP is more diversified.
• Our diversity question “Which SERP is more likely
to satisfy a higher number of users?” is new!
66

Judgement reliability
Each cell is either
LEFT, irrelevant, or RIGHT
15 judges
1,127 SERP pairs
Relevance preferences
Each cell is either
15 judges
1,127 SERP pairs
Diversity preferences
Krippendorff’s alpha
= 0.406
= 0.356
Substantial agreement beyond chance
67

Did every judge do the job
properly? YES
Each cell is either
15 judges
1,127 SERP pairs
Each cell is either
14 judges
1,127 SERP pairs
LEAVE OUT ONE ASSESSOR
Alpha does not
change much even if
we remove Judge 01
68

Full gold data
Each cell is either
15 judges
1,127 SERP pairs
Each cell is either
15 judges
1,127 SERP pairs
= 0.406
= 0.356
full gold data full gold data
1,115 SERP pairs 1,119 SERP pairs
Final label Final label
Majority vote Majority vote
69

High agreement gold data
Each cell is either
15 judges
1,127 SERP pairs
Each cell is either
15 judges
1,127 SERP pairs
high-agreement gold
data
high-agreement gold
data
894 SERP pairs 897 SERP pairs
Final label Final label
Majority vote Majority vote
= 0.406 → 0.518
= 0.356 → 0.453
Keep columns
where at least
9/15 agreed
70

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
71

Agreement with gold relevance (1)
72

iRBU (p=0.99) and nDCG perform best;
RBP (p=0.99, 0.85), Q, and EBR also do well.
73

Prec does not perform well
⇒ users care about ranking;
ERR performs worst
⇒ users care about the number of relevant docs.
74

iRBU (p=0.99) performs surprisingly well!
Suggests that it may be good to ignore
relevance in the Utility function!
Nonrelevant
Highly rel: 3
Nonrelevant
Partially rel: 1
Nonrelevant
0.99^2
= 0.980
0.99^5
=0.951
iRBU
= 3/4 * 0.99^2
+ 1/16 * 0.99^5
= 0.795
P ERR (2) = 3/4
P ERR (5) = 1/16
75
Ignores relevance;
inverse effort rather
than utility

Agreement with gold diversity (1)
76

D#-{nDCG, RBP (p=0.85)} perform best;
Other D#-measures and RBU also do well.
But since D#-{nDCG, RBP (p=0.85)} outperform RBU
even though they satisfy fewer axioms [Amigo+18],
measures that satisfy many axioms are not necessarily
the ones that align most with user perception.
77

D#-measures >> IA measures.
Use D# for diversity evaluation! 78

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
79

Conclusions: agreement with user preferences
• Adhoc measures (relevance preferences):
- iRBU (p=0.99) and nDCG perform best;
- RBP (p=0.99, 0.85), Q, and EBR also do well.
• Diversity measures (diversity preferences):
- D#-{nDCG, RBP (p=0.85)} perform best;
- Other D#-measures and RBU also do well.
- D# >> IA-measures.
- D#-{nDCG, RBP (p=0.85)} > RBU. Satisfying more
axioms may not mean better alignment with users.
80
Also, D# are more discriminative than IA!

Future work: close the gap
81
The present study: those who judged the documents
≠ those who performed SERP preference judgements.
Future work: let the same group of people perform both
doc preference AND SERP preference judgements

TALK OUTLINE
1. Motivation
6. Main results
8. Resources
82

Giving it all away at
http://waseda.box.com/SIGIR2019PACK
(a) topic-by-run matrices for all 30 evaluation
measures
(b) 1,127 topic-SERP-SERP triplets from INTENT-1
(c) two 15×1,127 user preference matrices
You can evaluate YOUR favourite adhoc and diversity
measures using our data with the NTCIR-9 INTENT-1
test collection + runs!
http://research.nii.ac.jp/ntcir/data/data-en.html
83

References (1)
[Agrawal+09] Diversifying Search Results. WSDM 2009.
[Amigo+18] An Axiomatic Analysis of Diversity Evaluation
Metrics: Introducting the Rank-Biased Utility Metric.
SIGIR 2018.
[Chapelle+09] Expected Reciprocal Rank for Graded
Relevance. CIKM 2009.
[Moffat+08] Rank-Biased Precision for Measurement of
Retrieval Effectiveness. ACM TOIS 27, 1.
[Sakai06SIGIR] Evaluating Evaluation Metrics based on
the Bootstrap, SIGIR 2006.
[Sakai+08EVIA] Modelling A User Population for
Designing Information Retrieval Metrics. EVIA 2008.
84

References (2)
[Sakai+11SIGIR] Evaluating Diversified Search Results
Using Per-Intent Graded Relevance. SIGIR 2011.
[Sakai12WWW] Evaluation with Informational and
Navigational Intents, WWW 2012.
[Sakai+13IRJ] Diversified Search Evaluation: Lessons from
the NTCIR-9 INTENT Task. Information Retrieval 16, 4.
[Sanderson+10] Do user preferences and evaluation
measures line up?. SIGIR 2010.
[Song+11] Overview of the NTCIR-9 INTENT Task. NTCIR-9.
[Zhai+03] Beyond Independent Relevance: Methods and
Evaluation Metrics for Subtopic Retrieval. SIGIR 2003.
85

Which Diversity Evaluation Measures Best Predict User Preferences

Recommended

Recommended

More Related Content

Similar to Which Diversity Evaluation Measures Best Predict User Preferences

Similar to Which Diversity Evaluation Measures Best Predict User Preferences (20)

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

Recently uploaded

Recently uploaded (20)

Which Diversity Evaluation Measures Best Predict User Preferences