Injustice - Developers Among Us (SciFiDevCon 2024)
Ā
Effect of Score Standardisation on Topic Set Size Design
1. The Effect of
Score Standardisation on
Topic Set Size Design
@tetsuyasakai
Waseda University, Japan
http://www.f.waseda.jp/tetsuya/sakai.html
November 30, 2016@AIRS 2016, Beijing.
3. Hard topics, easy topics
Mean = 0.12
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2
Mean = 0.70
4. Low-variance topics, high-variance topics
standard
deviation = 0.08
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2 standard
deviation = 0.29
5. Score standardisation [Webber+08]
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
āaverageā in standard
deviation units?
Standardising factors
6. Now for every topic, mean = 0, variance = 1.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Comparisons across different topic sets and test collections are possible!
7. Standardised scores have the [-ā, ā] range
and are not very convenient.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Transform them back into the [0,1] range!
8. std-CDF: use the cumulative density function of
the standard normal distribution [Webber+08]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
Each curve is
a topic, with
110 runs
represented
as dots
raw nDCG
std-CDF
nDCG
9. std-CDF: emphasises moderately high and
moderately low performers ā is this a good thing?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
raw nDCG
std-CDF
nDCG
Moderately
high
Moderately
low
10. std-AB: How about a simple linear
transformation? [Sakai16ICTIR]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)
TREC04
raw nDCG
11. std-AB with clipping, with the range [0,1]
Let B=0.5 (āaverageā system)
Let A=0.15 so that 89% of scores fall within [0.05, 0.95]
(Chebyshevās inequality)
For EXTREMELY good/bad systemsā¦
This formula with (A,B) is used in educational
research: A=100, B=500 for SAT, GRE [Lodico+10],
A=10, B=50 for Japanese hensachi āstandard scoresā.
13. [Sakai16ICTIR] bottom line
ā¢ Advantages of score standardisation:
- removes topic hardness, enables comparison across test collections
- normalisation becomes unnecessary
ā¢ Advantages of std-AB over std-CDF:
Low within-system variances and therefore
- Substantially lower swap rates (higher consistency across different
data)
- Enables us to consider realistic topic set sizes in topic set design
Swap rates for std-CDF can be higher than
those for raw scores, probably due to its
nonlinear transformation
std-AB is a good alternative to std-CDF.
15. Topic set size design (1) [Sakai16IRJ]
ā¢ Provides answers to the following question:
āIām building a new test collection. How many topics should I create?ā
ā¢ A prerequisite: a small topic-by-run score matrix based on pilot data,
for estimating within-system variances.
ā¢ Three approaches (with easy-to-use Excel tools), based on
[Nagata03]:
(1) paired t-test power
(2) one-way ANOVA power
(3) confidence interval width upperbound.
16. Topic set size design (2) [Sakai16IRJ]
Method Input required
Paired t-test Ī± (Type I error probability), Ī² (Type II error probability),
minDt (minimum detectable difference: whenever the diff between two systems is this
much or larger, we want to guarantee (1-Ī²)% power),
: variance estimate for the score delta.
one-way ANOVA Ī± (Type I error probability), Ī² (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-Ī²)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Confidence intervals Ī± (Type I error probability),
Ī“ (CI width upperbound: you want the CI for the diff between any system pair to be this
much or smaller),
: variance estimate for the score delta.
17. Topic set size design (3) [Sakai16IRJ]
Test collection designs should evolve based on past data
topic-by-run
score matrix with
pilot data
About 25 topics
with runs from
a few teams
probably sufficient
[Sakai+16EVIA]
n1 topics
m runs
Estimate n1 based on the
within-system variance
estimate
TREC 201X TREC 201(X+1)
n2 topics
n0 topics
Estimate n2 based on the
within-system variance
estimate
A more accurate estimate
18. Topic set size design (4) [Sakai16IRJ]
ANOVA-based results for
m=10 can be used instead
of CI-based results
ANOVA-based results for
m=2 can be used instead of
t-test-based results
In practice, you can deduce t-test-based and CI-based results from ANOVA-based results
Caveat: the ANOVA-based tool can only
handle (Ī±, Ī²)=(0.05, 0.20), (0.01, 0.20),
(0.05, 0.10), (0.01, 0.10).
19. Method Input required
one-way ANOVA Ī± (Type I error probability), Ī² (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-Ī²)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Example situation: You plan to compare m systems with one-way ANOVA with
Ī±=5%. You plan to use nDCG as a primary evaluation measure, and want to
guarantee 80% power whenever the diff between the best and the worst systems
>= minD.
You know from pilot data that the within-system variance for nDCG is around .
What is the required number of topics n?
Topic set size design with
one-way ANOVA (1) m systems
best
worst
minD <= D
21. Estimating the variance (1)
We need for topic set size design based on one-way ANOVA
and for that based on the paired t-test or CI.
From a pilot topic-by-run score matrix, obtain:
Then, if possible, pool multiple estimates to enhance accuracy:
Pooled estimate
By-product of one-way
ANOVA
(use two-way w/o
replilcation for tighter
estimates)
Multiple
data not
available
in this study
23. Variances obtained from NTCIR-12 tasks
mC nC
Variances
are substantially
smaller
after applying
std-AB.
Unnormalised
measures can
be handled
without any
problems.
24. Why the variances are smaller after applying std-AB
The initial estimate of n with the one-way ANOVA topic set size design
is given by [Nagata03]
where,
for (Ī±, Ī²)=(0.05, 0.20), Ī» ā
So n will be small if is small.
With std-AB, is indeed small because A is small (e.g. 0.15) and it can
be shown that
Noncentrality parameter of a noncentral
chi-square distribution
25. System rankings before and after applying std-AB
mC nC
System rankings
before and after
applying std-AB
are statistically
equivalent.
std-AB enables
cross-collection
comparisons
without affecting
within-collection
comparisons!
28. MobileClick-2 iUnit ranking (1) [Kato+16]
http://mobileclick.org/
ā¢ INPUT: iUnits (relevant nuggets for a mobile search summary)
ā¢ OUTPUT: iUnits ranked by relevance
ā¢ MEASURES:
nDCG [Jarvelin+02]
= Ī£ g(r)/log(r+1) / Ī£ g*(r)/log(r+1)
Q-measure [Sakai05AIRS04]
= (1/R) Ī£ I(r) BR(r) where BR(r) = ( Ī£ I(k) + Ī² Ī£ g(k) )/( r + Ī²Ī£ g*(k) )
l
r=1
l
r=1
r
r
k=1
r
k=1
r
k=1
gain at r in an ideal list
1 if relevant, 0 otherwise
29. MobileClick-2 iUnit ranking (2) [Kato+16]
http://mobileclick.org/
Raw nDCG:
- hard topics, easy topics
0
100
200
300
400
500
600
700
0
100
200
300
400
500
600
700
std-AB nDCG:
- topics look more comparable
to one another
30. MobileClick-2 iUnit summarisation (1) [Kato+16]
http://mobileclick.org/
ā¢ INPUT: iUnits (relevant nuggets for a mobile search summary)
ā¢ OUTPUT: two-layered textual
summary
ā¢ MEASURES:
M-measure, a variant of the
intent-aware U-measure
[Sakai+13SIGIR]
M-measure is an unnormalised
measure: does not have the [0,1] range.
(Intent-aware measures difficult to normalise.)
[Kato+16]
31. MobileClick-2 iUnit summarisation (2) [Kato+16]
http://mobileclick.org/
Raw M-measure:
- unnormalised, unbounded,
extremely large variances
- topics definitely not comparable
(note the different scale of the y axis)
std-AB M-measure:
- no problem!
0
100
200
300
400
500
0
100
200
300
400
500
600
40-45 0.9-1.0
Clearly violates
i.i.d
32. STC (short text conversation) (1) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
ā¢ INPUT: a Weibo post (Chinese tweet)
ā¢ OUTPUT: a ranked list of Weibo posts from a repository that serve as valid
responses to the input
ā¢ MEASURES:
nG@1
(normalised gain at 1,
a.k.a. ānDCG@1ā)
nERR@10
[Chapelle11]
P+ [Sakai06AIRS]
a variant of Q-measure
34. STC (short text conversation) (3) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
Raw nG@1:
- 0 or 1/3 or 1!
0
1000
2000
3000
0
500
1000
1500
2000
2500
std-AB nG@1:
- Looks like a continuous measure!
- Fewer 1ās
- No 0ās
35. QALab-2 (1) [Shibuki+16]
http://research.nii.ac.jp/qalab/
ā¢ INPUT: a multiple-choice Japanese National Center Test (university
entrance exam) question on world history
ā¢ OUTPUT: choice deemed correct by system
ā¢ MEASURES:
Boolean: 1 (correct) or 0 (incorrect)
36. QALab-2 (2) [Shibuki+16]
http://research.nii.ac.jp/qalab/
36 topicsRaw Boolean:
- 0 or 1!
std-AB Boolean:
- Two distinct ranges of values
[0.2999, 0.4460] and [0.6091, 0.9047]
Normal assumption still clearly violated: our topic set size design
results should be interpreted as those for normally-distributed measures
that happen to have variances similar to Raw/std-AB Boolean.
QALab-2 organisers sorted the topics
by #systems_correctly_answered
before providing the matrices to the present author
0
200
400
600
800
0
200
400
600
38. A few recommendations for MedNLPDoc (1)
With raw recall:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=2 systems
- minD=0.20 for m=50 systems
MedNLPDoc had
76-78 topics at NTCIR-12.
39. A few recommendations for MedNLPDoc (2)
With std-AB recall:
create 80 topics to guarantee 80% power for
- minD=0.05 for m=2 systems
- minD=0.10 for m=50 systems
MedNLPDoc had
76-78 topics at NTCIR-12.
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
40. A few recommendations for MobileClick-2 (1)
MobileClick-2 had 100 topics at NTCIR-12.
Topic set size needs to be set by considering both
subtasks, but raw M-measure cannot be handled
due to extremely large variance. If we only
consider iUnit ranking raw nDCG@3:
create 90 topics to guarantee 80% power for
- minD=0.10 for m=10 English systems
- minD=0.10 for m=2 Japanese systems
41. A few recommendations for MobileClick-2 (2)
MobileClick-2 had 100 topics at NTCIR-12.
With std-AB nDCG@3 and std-AB M-measure:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=20 English and m=30 Japanese
iUnit ranking systems
- minD=0.05 for m=10 English and m=10 Japanese
iUnit summarisation systems
42. A few recommendations for STC (1)
With (a normally distributed measure whose variance is similar to that of) raw nG@1:
create 120 topics to guarantee 80% power for
- minD=0.20 for m=20 systems
STC had
100 topics at NTCIR-12.
43. A few recommendations for STC (2)
STC had
100 topics at NTCIR-12.
With std-AB nG@1:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=30 systems
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
44. A few recommendations for QALab-2 (1)
QALab-2 had
36-41 topics at NTCIR-12:
not sufficient from the
viewpoint of power
With (a normally distributed measure whose variance is similar to that of) raw Boolean:
create 90 topics to guarantee 80% power for
- minD=0.20 for m=2 systems
45. A few recommendations for QALab-2 (2)
QALab-2 had
36-41 topics at NTCIR-12.
With (a normally distributed measure whose variance is similar to that of) std-AB Boolean:
create 40 topics to guarantee 80% power for
- minD=0.10 for m=2 systems
- minD=0.20 for m=50 systems
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
47. Conclusions
ā¢ std-AB suppresses score variances and thereby enables test collection
builders to consider realistic choices of topic set sizes.
ā¢ topic set size design with std-AB can handle even unnormalised such
as M-measure (U-measure, TBG, alpha-nDCG, ERR-IA etc.).
ā¢ Even discrete measures such as nG@1 (0 or 1/3 or 1) look more
continuous after applying std-AB, which makes the topic set size
design results (based on normality and i.i.d assumptions) perhaps a
little more believable.
ā¢ Test collection designs should evolve based on experiences (i.e.
variances pooled from past data).
49. How long will the standardisation factors for
each topic remain valid?
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
āaverageā in standard
deviation units?
Standardising factors
These systems will
eventually
become outdated,
right?
50. We Want Web@NTCIR-13 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
New runs
pooled for
frozen + fresh
topics
51. We Want Web@NTCIR-13 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
Official NTCIR-13
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13
systems
NOT released
Qrels + std. factors
based on
NTCIR-13
systems
released
52. We Want Web@NTCIR-14 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
53. We Want Web@NTCIR-14 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
Official NTCIR-14
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
NOT released
Qrels + std. factors
based on
NTCIR-(13+)14
systems
released
Using the NTCIR-14 fresh
topics, compare new NTCIR-
14 runs with revived runs and
quantify progress.
54. We Want Web@NTCIR-15 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
55. We Want Web@NTCIR-15 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Using the NTCIR-15 fresh
topics, compare new NTCIR-
15 runs with revived runs and
quantify progress.
56. We Want Web@NTCIR-15 (3)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the standardisation
factors for each frozen topic
differ across the 3 rounds?
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
57. We Want Web@NTCIR-15 (4)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the NTCIR-15 system
rankings differ across the 3
rounds, with and w/o
standardisation?
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
58. See you all in Tokyo, in August/December 2017!
59. Selected references (1)
[Aramaki+16] Aramaki et al.: Overview of the NTCIR-12 MedNLPDoc task, NTCIR-12
Proceedings, 2016.
[Carterette+08] Carterette et al.: Evaluation over Thousands of Queries, SIGIR 2008.
[Chapelle+11] Chapelle et al.: Intent-based Diversification of Web Search Results: Metrics
and Algorithms, Information Retrieval 14(6), 2011.
[Jarvelin+02] Jarvelin and Kelalainen: Cumulated Gain-based Evaluation of IR techniques,
ACM TOIS 20(4), 2002.
[Gilbert+79] Gilbert and Sparck Jones:, Statistical Bases of Relevance assessment for the
`IDEALā Information Retrieval Test Collection, Computer Laboratory, University of
Cambridge, 1979.
[Kato+16] Kato et al.: Overview of the NTCIR-12 MobileClick task, NTCIR-12 Proceedings,
2016.
[Nagata03] Nagata: How to Design the Sample Size (in Japanese), Asakura Shoten, 2003.
60. Selected references (2)
[Sakai05AIRS04] Sakai: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS
2004 (LNCS 3411), 2005.
[Sakai06AIRS] Sakai: Bootstrap-based Comparisons of IR Metrics for Finding One Relevant
Document, AIRS 2006 (LNCS 4182).
[Sakai+13SIGIR] Sakai and Dou: Summaries, Ranked Retrieval and Sessions: A Unified
Framework for Information Access Evaluation, SIGIR 2013.
[Sakai16ICTIR] Sakai: A simple and effective approach to score standardisaiton, ICTIR 2016.
[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice (tutorial),
ICTIR 2016.
[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval, 19(3), 2016. OPEN ACCESS:
http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf
[Sakai+16EVIA] Sakai and Shang: On Estimating Variances for Topic Set Size Design, EVIA
2016. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/evia/02-
EVIA2016-SakaiT.pdf
61. Selected references (3)
[Shang+16] Shang et al.: Overview of the NTCIR-12 short text conversation task, NTCIR-12
Proceedings, 2016.
[Shibuki+16] Shibuki et al.: Overview of the NTCIR-12 QA Lab-2 task, NTCIR-12 Proceedings,
2016.
[SparckJones+75] Sparck Jones and Van Rijsbergen: Report on the Need for and Provision
on an `Idealā Information Retrieval Test Collection, Computer Laboratory, University of
Cambridge, 1975.
[Voorhees+05] Voorhees and Harman: TREC: Experiment and Evaluation in Information
Retrieval, The MIT Press, 2005.
[Voorhees09] Voorhees: Topic Set Size Redux, SIGIR 2009.
[Webber+08SIGIR] Webber, Moffat, Zobel: Score standardisation for inter-collection
comparison of retrieval systems, SIGIR 2008.
[Webber+08CIKM] Webber, Moffat, Zobel: Statistical power in retrieval experimentation,
CIKM 2008.