SlideShare a Scribd company logo
1 of 61
Download to read offline
The Effect of
Score Standardisation on
Topic Set Size Design
@tetsuyasakai
Waseda University, Japan
http://www.f.waseda.jp/tetsuya/sakai.html
November 30, 2016@AIRS 2016, Beijing.
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Hard topics, easy topics
Mean = 0.12
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2
Mean = 0.70
Low-variance topics, high-variance topics
standard
deviation = 0.08
0
0.2
0.4
0.6
0.8
1
System 1 System 2 System 3 System 4 System 5
Topic 1 Topic 2 standard
deviation = 0.29
Score standardisation [Webber+08]
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
ā€œaverageā€ in standard
deviation units?
Standardising factors
Now for every topic, mean = 0, variance = 1.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Comparisons across different topic sets and test collections are possible!
Standardised scores have the [-āˆž, āˆž] range
and are not very convenient.
-2
-1
0
1
2
System 1System 2System 3System 4System 5
Topic 1 Topic 2
Transform them back into the [0,1] range!
std-CDF: use the cumulative density function of
the standard normal distribution [Webber+08]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
Each curve is
a topic, with
110 runs
represented
as dots
raw nDCG
std-CDF
nDCG
std-CDF: emphasises moderately high and
moderately low performers ā€“ is this a good thing?
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
TREC04
raw nDCG
std-CDF
nDCG
Moderately
high
Moderately
low
std-AB: How about a simple linear
transformation? [Sakai16ICTIR]
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15)
TREC04
raw nDCG
std-AB with clipping, with the range [0,1]
Let B=0.5 (ā€œaverageā€ system)
Let A=0.15 so that 89% of scores fall within [0.05, 0.95]
(Chebyshevā€™s inequality)
For EXTREMELY good/bad systemsā€¦
This formula with (A,B) is used in educational
research: A=100, B=500 for SAT, GRE [Lodico+10],
A=10, B=50 for Japanese hensachi ā€œstandard scoresā€.
In practice, clipping does not happen often.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 raw nDCG
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
TREC04 std-AB nDCG
Topic ID
[Sakai16ICTIR] bottom line
ā€¢ Advantages of score standardisation:
- removes topic hardness, enables comparison across test collections
- normalisation becomes unnecessary
ā€¢ Advantages of std-AB over std-CDF:
Low within-system variances and therefore
- Substantially lower swap rates (higher consistency across different
data)
- Enables us to consider realistic topic set sizes in topic set design
Swap rates for std-CDF can be higher than
those for raw scores, probably due to its
nonlinear transformation
std-AB is a good alternative to std-CDF.
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Topic set size design (1) [Sakai16IRJ]
ā€¢ Provides answers to the following question:
ā€œIā€™m building a new test collection. How many topics should I create?ā€
ā€¢ A prerequisite: a small topic-by-run score matrix based on pilot data,
for estimating within-system variances.
ā€¢ Three approaches (with easy-to-use Excel tools), based on
[Nagata03]:
(1) paired t-test power
(2) one-way ANOVA power
(3) confidence interval width upperbound.
Topic set size design (2) [Sakai16IRJ]
Method Input required
Paired t-test Ī± (Type I error probability), Ī² (Type II error probability),
minDt (minimum detectable difference: whenever the diff between two systems is this
much or larger, we want to guarantee (1-Ī²)% power),
: variance estimate for the score delta.
one-way ANOVA Ī± (Type I error probability), Ī² (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-Ī²)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Confidence intervals Ī± (Type I error probability),
Ī“ (CI width upperbound: you want the CI for the diff between any system pair to be this
much or smaller),
: variance estimate for the score delta.
Topic set size design (3) [Sakai16IRJ]
Test collection designs should evolve based on past data
topic-by-run
score matrix with
pilot data
About 25 topics
with runs from
a few teams
probably sufficient
[Sakai+16EVIA]
n1 topics
m runs
Estimate n1 based on the
within-system variance
estimate
TREC 201X TREC 201(X+1)
n2 topics
n0 topics
Estimate n2 based on the
within-system variance
estimate
A more accurate estimate
Topic set size design (4) [Sakai16IRJ]
ANOVA-based results for
m=10 can be used instead
of CI-based results
ANOVA-based results for
m=2 can be used instead of
t-test-based results
In practice, you can deduce t-test-based and CI-based results from ANOVA-based results
Caveat: the ANOVA-based tool can only
handle (Ī±, Ī²)=(0.05, 0.20), (0.01, 0.20),
(0.05, 0.10), (0.01, 0.10).
Method Input required
one-way ANOVA Ī± (Type I error probability), Ī² (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-Ī²)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Example situation: You plan to compare m systems with one-way ANOVA with
Ī±=5%. You plan to use nDCG as a primary evaluation measure, and want to
guarantee 80% power whenever the diff between the best and the worst systems
>= minD.
You know from pilot data that the within-system variance for nDCG is around .
What is the required number of topics n?
Topic set size design with
one-way ANOVA (1) m systems
best
worst
minD <= D
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
will do this for you! Use the appropriate sheet for a given (Ī±, Ī²) and fill
out the orange cells.
:
n=20 is what you
want!
Topic set size design with
one-way ANOVA (2)
Estimating the variance (1)
We need for topic set size design based on one-way ANOVA
and for that based on the paired t-test or CI.
From a pilot topic-by-run score matrix, obtain:
Then, if possible, pool multiple estimates to enhance accuracy:
Pooled estimate
By-product of one-way
ANOVA
(use two-way w/o
replilcation for tighter
estimates)
Multiple
data not
available
in this study
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Variances obtained from NTCIR-12 tasks
mC nC
Variances
are substantially
smaller
after applying
std-AB.
Unnormalised
measures can
be handled
without any
problems.
Why the variances are smaller after applying std-AB
The initial estimate of n with the one-way ANOVA topic set size design
is given by [Nagata03]
where,
for (Ī±, Ī²)=(0.05, 0.20), Ī» ā‰’
So n will be small if is small.
With std-AB, is indeed small because A is small (e.g. 0.15) and it can
be shown that
Noncentrality parameter of a noncentral
chi-square distribution
System rankings before and after applying std-AB
mC nC
System rankings
before and after
applying std-AB
are statistically
equivalent.
std-AB enables
cross-collection
comparisons
without affecting
within-collection
comparisons!
MedNLPDoc (1) [Aramaki+16]
https://sites.google.com/site/mednlpdoc/
ā€¢ INPUT: a medical record
ā€¢ OUTPUT: ICD (international classification of diseases) codes of
possible disease names
ā€¢ MEASURES: precision and recall of ICDs
precision
recall
14 runs 14 runs
78 topics
76 topics
MedNLPDoc (2) [Aramaki+16]
https://sites.google.com/site/mednlpdoc/
76 topics
Raw recall:
- Lots of 0ā€™s
- Some 1ā€™s
std-AB recall:
- No 0ā€™s
- Fewer 1ā€™s
0
100
200
300
400
500
600
700
0
50
100
150
200
250
300
350
score range score range
MobileClick-2 iUnit ranking (1) [Kato+16]
http://mobileclick.org/
ā€¢ INPUT: iUnits (relevant nuggets for a mobile search summary)
ā€¢ OUTPUT: iUnits ranked by relevance
ā€¢ MEASURES:
nDCG [Jarvelin+02]
= Ī£ g(r)/log(r+1) / Ī£ g*(r)/log(r+1)
Q-measure [Sakai05AIRS04]
= (1/R) Ī£ I(r) BR(r) where BR(r) = ( Ī£ I(k) + Ī² Ī£ g(k) )/( r + Ī²Ī£ g*(k) )
l
r=1
l
r=1
r
r
k=1
r
k=1
r
k=1
gain at r in an ideal list
1 if relevant, 0 otherwise
MobileClick-2 iUnit ranking (2) [Kato+16]
http://mobileclick.org/
Raw nDCG:
- hard topics, easy topics
0
100
200
300
400
500
600
700
0
100
200
300
400
500
600
700
std-AB nDCG:
- topics look more comparable
to one another
MobileClick-2 iUnit summarisation (1) [Kato+16]
http://mobileclick.org/
ā€¢ INPUT: iUnits (relevant nuggets for a mobile search summary)
ā€¢ OUTPUT: two-layered textual
summary
ā€¢ MEASURES:
M-measure, a variant of the
intent-aware U-measure
[Sakai+13SIGIR]
M-measure is an unnormalised
measure: does not have the [0,1] range.
(Intent-aware measures difficult to normalise.)
[Kato+16]
MobileClick-2 iUnit summarisation (2) [Kato+16]
http://mobileclick.org/
Raw M-measure:
- unnormalised, unbounded,
extremely large variances
- topics definitely not comparable
(note the different scale of the y axis)
std-AB M-measure:
- no problem!
0
100
200
300
400
500
0
100
200
300
400
500
600
40-45 0.9-1.0
Clearly violates
i.i.d
STC (short text conversation) (1) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
ā€¢ INPUT: a Weibo post (Chinese tweet)
ā€¢ OUTPUT: a ranked list of Weibo posts from a repository that serve as valid
responses to the input
ā€¢ MEASURES:
nG@1
(normalised gain at 1,
a.k.a. ā€œnDCG@1ā€)
nERR@10
[Chapelle11]
P+ [Sakai06AIRS]
a variant of Q-measure
STC (short text conversation) (2) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
Raw P+:
- Lots of 1ā€™s 0ā€™s
- Gap in the [0.625, 1] range
(see previous slide)
std-AB P+:
- Looks like a continuous measure!
- Fewer 1ā€™s
- No 0ā€™s
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
0
500
1000
1500
0
500
1000
1500
STC (short text conversation) (3) [Shang+16]
http://ntcir12.noahlab.com.hk/stc.htm
Raw nG@1:
- 0 or 1/3 or 1!
0
1000
2000
3000
0
500
1000
1500
2000
2500
std-AB nG@1:
- Looks like a continuous measure!
- Fewer 1ā€™s
- No 0ā€™s
QALab-2 (1) [Shibuki+16]
http://research.nii.ac.jp/qalab/
ā€¢ INPUT: a multiple-choice Japanese National Center Test (university
entrance exam) question on world history
ā€¢ OUTPUT: choice deemed correct by system
ā€¢ MEASURES:
Boolean: 1 (correct) or 0 (incorrect)
QALab-2 (2) [Shibuki+16]
http://research.nii.ac.jp/qalab/
36 topicsRaw Boolean:
- 0 or 1!
std-AB Boolean:
- Two distinct ranges of values
[0.2999, 0.4460] and [0.6091, 0.9047]
Normal assumption still clearly violated: our topic set size design
results should be interpreted as those for normally-distributed measures
that happen to have variances similar to Raw/std-AB Boolean.
QALab-2 organisers sorted the topics
by #systems_correctly_answered
before providing the matrices to the present author
0
200
400
600
800
0
200
400
600
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
A few recommendations for MedNLPDoc (1)
With raw recall:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=2 systems
- minD=0.20 for m=50 systems
MedNLPDoc had
76-78 topics at NTCIR-12.
A few recommendations for MedNLPDoc (2)
With std-AB recall:
create 80 topics to guarantee 80% power for
- minD=0.05 for m=2 systems
- minD=0.10 for m=50 systems
MedNLPDoc had
76-78 topics at NTCIR-12.
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
A few recommendations for MobileClick-2 (1)
MobileClick-2 had 100 topics at NTCIR-12.
Topic set size needs to be set by considering both
subtasks, but raw M-measure cannot be handled
due to extremely large variance. If we only
consider iUnit ranking raw nDCG@3:
create 90 topics to guarantee 80% power for
- minD=0.10 for m=10 English systems
- minD=0.10 for m=2 Japanese systems
A few recommendations for MobileClick-2 (2)
MobileClick-2 had 100 topics at NTCIR-12.
With std-AB nDCG@3 and std-AB M-measure:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=20 English and m=30 Japanese
iUnit ranking systems
- minD=0.05 for m=10 English and m=10 Japanese
iUnit summarisation systems
A few recommendations for STC (1)
With (a normally distributed measure whose variance is similar to that of) raw nG@1:
create 120 topics to guarantee 80% power for
- minD=0.20 for m=20 systems
STC had
100 topics at NTCIR-12.
A few recommendations for STC (2)
STC had
100 topics at NTCIR-12.
With std-AB nG@1:
create 100 topics to guarantee 80% power for
- minD=0.10 for m=30 systems
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
A few recommendations for QALab-2 (1)
QALab-2 had
36-41 topics at NTCIR-12:
not sufficient from the
viewpoint of power
With (a normally distributed measure whose variance is similar to that of) raw Boolean:
create 90 topics to guarantee 80% power for
- minD=0.20 for m=2 systems
A few recommendations for QALab-2 (2)
QALab-2 had
36-41 topics at NTCIR-12.
With (a normally distributed measure whose variance is similar to that of) std-AB Boolean:
create 40 topics to guarantee 80% power for
- minD=0.10 for m=2 systems
- minD=0.20 for m=50 systems
Topic set size choices look much
more practical when std-AB is
used (due to low variance)
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
Conclusions
ā€¢ std-AB suppresses score variances and thereby enables test collection
builders to consider realistic choices of topic set sizes.
ā€¢ topic set size design with std-AB can handle even unnormalised such
as M-measure (U-measure, TBG, alpha-nDCG, ERR-IA etc.).
ā€¢ Even discrete measures such as nG@1 (0 or 1/3 or 1) look more
continuous after applying std-AB, which makes the topic set size
design results (based on normality and i.i.d assumptions) perhaps a
little more believable.
ā€¢ Test collection designs should evolve based on experiences (i.e.
variances pooled from past data).
TALK OUTLINE
1. Score standardisation
2. Topic set size design
3. NTCIR-12 tasks
4. Results
5. Conclusions
6. Future work: NTCIR WWW
How long will the standardisation factors for
each topic remain valid?
standardised score for i-th system, j-th topic
j
i
raw
Topics
Systems
j
i
std
Topics
Systems
Subtract mean;
divide by standard deviation
How good is i compared to
ā€œaverageā€ in standard
deviation units?
Standardising factors
These systems will
eventually
become outdated,
right?
We Want Web@NTCIR-13 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
New runs
pooled for
frozen + fresh
topics
We Want Web@NTCIR-13 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017)
frozen topic set
NTCIR-13 fresh
topic set
NTCIR-13
systems
Official NTCIR-13
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13
systems
NOT released
Qrels + std. factors
based on
NTCIR-13
systems
released
We Want Web@NTCIR-14 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
We Want Web@NTCIR-14 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019)
frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
Official NTCIR-14
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
NOT released
Qrels + std. factors
based on
NTCIR-(13+)14
systems
released
Using the NTCIR-14 fresh
topics, compare new NTCIR-
14 runs with revived runs and
quantify progress.
We Want Web@NTCIR-15 (1)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
New runs
pooled for
frozen + fresh
topics
Revived runs
pooled for
fresh topics
We Want Web@NTCIR-15 (2)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Using the NTCIR-15 fresh
topics, compare new NTCIR-
15 runs with revived runs and
quantify progress.
We Want Web@NTCIR-15 (3)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the standardisation
factors for each frozen topic
differ across the 3 rounds?
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
We Want Web@NTCIR-15 (4)
http://www.thuir.cn/ntcirwww/
NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020)
frozen topic set frozen topic set frozen topic set
NTCIR-13 fresh
topic set
NTCIR-14 fresh
topic set
NTCIR-15 fresh
topic set
NTCIR-13
systems
NTCIR-14
systems
NTCIR-15
systems
Qrels + std. factors
based on
NTCIR-(13+14+)15
systems
released
Official NTCIR-15
results discussed
with the fresh topics
Qrels + std. factors
based on
NTCIR-13+14+15
systems
released
Qrels + std. factors
based on
NTCIR-13+14
systems
released
Qrels + std. factors
based on
NTCIR-13
systems
released
How do the NTCIR-15 system
rankings differ across the 3
rounds, with and w/o
standardisation?
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
NTCIR-15
systems
ranking
See you all in Tokyo, in August/December 2017!
Selected references (1)
[Aramaki+16] Aramaki et al.: Overview of the NTCIR-12 MedNLPDoc task, NTCIR-12
Proceedings, 2016.
[Carterette+08] Carterette et al.: Evaluation over Thousands of Queries, SIGIR 2008.
[Chapelle+11] Chapelle et al.: Intent-based Diversification of Web Search Results: Metrics
and Algorithms, Information Retrieval 14(6), 2011.
[Jarvelin+02] Jarvelin and Kelalainen: Cumulated Gain-based Evaluation of IR techniques,
ACM TOIS 20(4), 2002.
[Gilbert+79] Gilbert and Sparck Jones:, Statistical Bases of Relevance assessment for the
`IDEALā€™ Information Retrieval Test Collection, Computer Laboratory, University of
Cambridge, 1979.
[Kato+16] Kato et al.: Overview of the NTCIR-12 MobileClick task, NTCIR-12 Proceedings,
2016.
[Nagata03] Nagata: How to Design the Sample Size (in Japanese), Asakura Shoten, 2003.
Selected references (2)
[Sakai05AIRS04] Sakai: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS
2004 (LNCS 3411), 2005.
[Sakai06AIRS] Sakai: Bootstrap-based Comparisons of IR Metrics for Finding One Relevant
Document, AIRS 2006 (LNCS 4182).
[Sakai+13SIGIR] Sakai and Dou: Summaries, Ranked Retrieval and Sessions: A Unified
Framework for Information Access Evaluation, SIGIR 2013.
[Sakai16ICTIR] Sakai: A simple and effective approach to score standardisaiton, ICTIR 2016.
[Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice (tutorial),
ICTIR 2016.
[Sakai16IRJ] Sakai: Topic set size design, Information Retrieval, 19(3), 2016. OPEN ACCESS:
http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf
[Sakai+16EVIA] Sakai and Shang: On Estimating Variances for Topic Set Size Design, EVIA
2016. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/evia/02-
EVIA2016-SakaiT.pdf
Selected references (3)
[Shang+16] Shang et al.: Overview of the NTCIR-12 short text conversation task, NTCIR-12
Proceedings, 2016.
[Shibuki+16] Shibuki et al.: Overview of the NTCIR-12 QA Lab-2 task, NTCIR-12 Proceedings,
2016.
[SparckJones+75] Sparck Jones and Van Rijsbergen: Report on the Need for and Provision
on an `Idealā€™ Information Retrieval Test Collection, Computer Laboratory, University of
Cambridge, 1975.
[Voorhees+05] Voorhees and Harman: TREC: Experiment and Evaluation in Information
Retrieval, The MIT Press, 2005.
[Voorhees09] Voorhees: Topic Set Size Redux, SIGIR 2009.
[Webber+08SIGIR] Webber, Moffat, Zobel: Score standardisation for inter-collection
comparison of retrieval systems, SIGIR 2008.
[Webber+08CIKM] Webber, Moffat, Zobel: Statistical power in retrieval experimentation,
CIKM 2008.

More Related Content

What's hot

Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksKevin Lee
Ā 
The jackknife and bootstrap
The jackknife and bootstrapThe jackknife and bootstrap
The jackknife and bootstrapPaul Gardner
Ā 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overviewSebastien Destercke
Ā 
Admission in India
Admission in IndiaAdmission in India
Admission in IndiaEdhole.com
Ā 
ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„
ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„
ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„HyeonSeok Choi
Ā 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detectionģ²  ź¹€
Ā 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for RecommendationOlivier Jeunen
Ā 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Olivier Jeunen
Ā 

What's hot (10)

Machine Learning : why we should know and how it works
Machine Learning : why we should know and how it worksMachine Learning : why we should know and how it works
Machine Learning : why we should know and how it works
Ā 
The jackknife and bootstrap
The jackknife and bootstrapThe jackknife and bootstrap
The jackknife and bootstrap
Ā 
Imprecision in learning: an overview
Imprecision in learning: an overviewImprecision in learning: an overview
Imprecision in learning: an overview
Ā 
Admission in India
Admission in IndiaAdmission in India
Admission in India
Ā 
ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„
ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„
ė°ģ“ķ„° ź³¼ķ•™ ģž…ė¬ø 5ģž„
Ā 
[ē³»åˆ—ę“»å‹•] Machine Learning ę©Ÿå™Øå­øēæ’čŖ²ē؋
[ē³»åˆ—ę“»å‹•] Machine Learning ę©Ÿå™Øå­øēæ’čŖ²ē؋[ē³»åˆ—ę“»å‹•] Machine Learning ę©Ÿå™Øå­øēæ’čŖ²ē؋
[ē³»åˆ—ę“»å‹•] Machine Learning ę©Ÿå™Øå­øēæ’čŖ²ē؋
Ā 
Anomaly detection
Anomaly detectionAnomaly detection
Anomaly detection
Ā 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
Ā 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Ā 
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
MUMS: Transition & SPUQ Workshop - Some Strategies to Quantify Uncertainty fo...
Ā 

Viewers also liked

Needle control_Chittagong Asian Apparels Ltd
Needle control_Chittagong Asian Apparels LtdNeedle control_Chittagong Asian Apparels Ltd
Needle control_Chittagong Asian Apparels LtdBarua Sujan
Ā 
Threads needles
Threads needlesThreads needles
Threads needlessri dhar
Ā 
Needle broken procedure sample
Needle broken procedure sampleNeedle broken procedure sample
Needle broken procedure sampleKien Ly
Ā 
Pull Test Presentation
Pull Test PresentationPull Test Presentation
Pull Test PresentationIheanyi Ekechukwu
Ā 
Buttons.Ppt Powerpoint
Buttons.Ppt PowerpointButtons.Ppt Powerpoint
Buttons.Ppt Powerpointswampfoxoz
Ā 
Garment management system
Garment management systemGarment management system
Garment management systemBipul Roy Bpl
Ā 
Different types of button are used in garments
Different types of button are used in garmentsDifferent types of button are used in garments
Different types of button are used in garmentsHindustan University
Ā 
Pe 6421 chapter 3 iso 9000 quality system oct 13 2014
Pe 6421 chapter 3  iso 9000 quality system oct 13  2014Pe 6421 chapter 3  iso 9000 quality system oct 13  2014
Pe 6421 chapter 3 iso 9000 quality system oct 13 2014Charlton Inao
Ā 
Sewing thread and its types
Sewing thread and its typesSewing thread and its types
Sewing thread and its typesRupali Arya
Ā 
Presentation for fit
Presentation for fitPresentation for fit
Presentation for fitOptiTex
Ā 
APPAREL QUALITY STANDARD AND IMPLEMENTATION
APPAREL QUALITY STANDARD AND IMPLEMENTATIONAPPAREL QUALITY STANDARD AND IMPLEMENTATION
APPAREL QUALITY STANDARD AND IMPLEMENTATIONGOPALAKRISHNAN DURAISAMY
Ā 
trims and accesories quality processes
trims and accesories quality processestrims and accesories quality processes
trims and accesories quality processesSanjeet Sudarshan
Ā 
Parametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use whichParametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use whichGƶnenƧ DalgıƧ
Ā 

Viewers also liked (20)

Children Product Safety v1
Children Product Safety  v1Children Product Safety  v1
Children Product Safety v1
Ā 
Needle control_Chittagong Asian Apparels Ltd
Needle control_Chittagong Asian Apparels LtdNeedle control_Chittagong Asian Apparels Ltd
Needle control_Chittagong Asian Apparels Ltd
Ā 
qms
qmsqms
qms
Ā 
Threads needles
Threads needlesThreads needles
Threads needles
Ā 
Needle broken procedure sample
Needle broken procedure sampleNeedle broken procedure sample
Needle broken procedure sample
Ā 
50255181 count
50255181 count50255181 count
50255181 count
Ā 
Pull Test Presentation
Pull Test PresentationPull Test Presentation
Pull Test Presentation
Ā 
Buttons.Ppt Powerpoint
Buttons.Ppt PowerpointButtons.Ppt Powerpoint
Buttons.Ppt Powerpoint
Ā 
Garment management system
Garment management systemGarment management system
Garment management system
Ā 
Sewing thread
Sewing threadSewing thread
Sewing thread
Ā 
Different types of button are used in garments
Different types of button are used in garmentsDifferent types of button are used in garments
Different types of button are used in garments
Ā 
Pe 6421 chapter 3 iso 9000 quality system oct 13 2014
Pe 6421 chapter 3  iso 9000 quality system oct 13  2014Pe 6421 chapter 3  iso 9000 quality system oct 13  2014
Pe 6421 chapter 3 iso 9000 quality system oct 13 2014
Ā 
Sewing thread and its types
Sewing thread and its typesSewing thread and its types
Sewing thread and its types
Ā 
QA QC
QA QCQA QC
QA QC
Ā 
Presentation for fit
Presentation for fitPresentation for fit
Presentation for fit
Ā 
APPAREL QUALITY STANDARD AND IMPLEMENTATION
APPAREL QUALITY STANDARD AND IMPLEMENTATIONAPPAREL QUALITY STANDARD AND IMPLEMENTATION
APPAREL QUALITY STANDARD AND IMPLEMENTATION
Ā 
trims and accesories quality processes
trims and accesories quality processestrims and accesories quality processes
trims and accesories quality processes
Ā 
2 quality assurance
2 quality assurance2 quality assurance
2 quality assurance
Ā 
Parametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use whichParametric vs Nonparametric Tests: When to use which
Parametric vs Nonparametric Tests: When to use which
Ā 
Garment costing
Garment costingGarment costing
Garment costing
Ā 

Similar to Effect of Score Standardisation on Topic Set Size Design

Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATetsuya Sakai
Ā 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTetsuya Sakai
Ā 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicatorsvie_dels
Ā 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSung Kim
Ā 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Ā 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
Ā 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning modelsKyriakos Chatzidimitriou
Ā 
Manufacturing Data Analytics
Manufacturing Data AnalyticsManufacturing Data Analytics
Manufacturing Data AnalyticsGian Antonio Susto
Ā 
Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Fwdays
Ā 
Cukic Promise08 V3
Cukic Promise08 V3Cukic Promise08 V3
Cukic Promise08 V3gregoryg
Ā 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairClaire Le Goues
Ā 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoringharmonylab
Ā 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
Ā 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Chakkrit (Kla) Tantithamthavorn
Ā 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017Manish Pandey
Ā 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learningSung Kim
Ā 

Similar to Effect of Score Standardisation on Topic Set Size Design (20)

Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVATopic Set Size Design with Variance Estimates from Two-Way ANOVA
Topic Set Size Design with Variance Estimates from Two-Way ANOVA
Ā 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Ā 
Blinkdb
BlinkdbBlinkdb
Blinkdb
Ā 
A Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality IndicatorsA Validation of Object-Oriented Design Metrics as Quality Indicators
A Validation of Object-Oriented Design Metrics as Quality Indicators
Ā 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
Ā 
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationBridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Bridging the Gap: Machine Learning for Ubiquitous Computing -- Evaluation
Ā 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Ā 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
Ā 
Simple rules for building robust machine learning models
Simple rules for building robust machine learning modelsSimple rules for building robust machine learning models
Simple rules for building robust machine learning models
Ā 
Manufacturing Data Analytics
Manufacturing Data AnalyticsManufacturing Data Analytics
Manufacturing Data Analytics
Ā 
DSE-complete.pptx
DSE-complete.pptxDSE-complete.pptx
DSE-complete.pptx
Ā 
Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"Yulia Honcharenko "Application of metric learning for logo recognition"
Yulia Honcharenko "Application of metric learning for logo recognition"
Ā 
Cukic Promise08 V3
Cukic Promise08 V3Cukic Promise08 V3
Cukic Promise08 V3
Ā 
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program RepairIt Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
It Does What You Say, Not What You Mean: Lessons From A Decade of Program Repair
Ā 
Study on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit ScoringStudy on Application of Ensemble learning on Credit Scoring
Study on Application of Ensemble learning on Credit Scoring
Ā 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
Ā 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Ā 
Pandas application
Pandas applicationPandas application
Pandas application
Ā 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
Ā 
Transfer defect learning
Transfer defect learningTransfer defect learning
Transfer defect learning
Ā 

More from Tetsuya Sakai

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overviewTetsuya Sakai
Ā 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909Tetsuya Sakai
Ā 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overviewTetsuya Sakai
Ā 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalisedTetsuya Sakai
Ā 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorialTetsuya Sakai
Ā 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorialTetsuya Sakai
Ā 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorialTetsuya Sakai
Ā 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimityTetsuya Sakai
Ā 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessorsTetsuya Sakai
Ā 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialoguesTetsuya Sakai
Ā 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesianTetsuya Sakai
Ā 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invitedTetsuya Sakai
Ā 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignTetsuya Sakai
Ā 

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
Ā 
sigir2020
sigir2020sigir2020
sigir2020
Ā 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
Ā 
sigir2019
sigir2019sigir2019
sigir2019
Ā 
assia2019
assia2019assia2019
assia2019
Ā 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
Ā 
evia2019
evia2019evia2019
evia2019
Ā 
ecir2019tutorial-finalised
ecir2019tutorial-finalisedecir2019tutorial-finalised
ecir2019tutorial-finalised
Ā 
ecir2019tutorial
ecir2019tutorialecir2019tutorial
ecir2019tutorial
Ā 
WSDM2019tutorial
WSDM2019tutorialWSDM2019tutorial
WSDM2019tutorial
Ā 
sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
Ā 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
Ā 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
Ā 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
Ā 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
Ā 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
Ā 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
Ā 
Nl201609
Nl201609Nl201609
Nl201609
Ā 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
Ā 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size Design
Ā 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...gurkirankumar98700
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
Ā 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationRadu Cotescu
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
Ā 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Ā 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
Ā 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Ā 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service šŸø 8923113531 šŸŽ° Avail...
Ā 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
Ā 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
Ā 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
Ā 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
Ā 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
Ā 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
Ā 
Scaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organizationScaling API-first ā€“ The story of a global engineering organization
Scaling API-first ā€“ The story of a global engineering organization
Ā 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
Ā 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
Ā 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Ā 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Ā 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
Ā 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Ā 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Ā 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
Ā 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Ā 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Ā 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Ā 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Ā 

Effect of Score Standardisation on Topic Set Size Design

  • 1. The Effect of Score Standardisation on Topic Set Size Design @tetsuyasakai Waseda University, Japan http://www.f.waseda.jp/tetsuya/sakai.html November 30, 2016@AIRS 2016, Beijing.
  • 2. TALK OUTLINE 1. Score standardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 3. Hard topics, easy topics Mean = 0.12 0 0.2 0.4 0.6 0.8 1 System 1 System 2 System 3 System 4 System 5 Topic 1 Topic 2 Mean = 0.70
  • 4. Low-variance topics, high-variance topics standard deviation = 0.08 0 0.2 0.4 0.6 0.8 1 System 1 System 2 System 3 System 4 System 5 Topic 1 Topic 2 standard deviation = 0.29
  • 5. Score standardisation [Webber+08] standardised score for i-th system, j-th topic j i raw Topics Systems j i std Topics Systems Subtract mean; divide by standard deviation How good is i compared to ā€œaverageā€ in standard deviation units? Standardising factors
  • 6. Now for every topic, mean = 0, variance = 1. -2 -1 0 1 2 System 1System 2System 3System 4System 5 Topic 1 Topic 2 Comparisons across different topic sets and test collections are possible!
  • 7. Standardised scores have the [-āˆž, āˆž] range and are not very convenient. -2 -1 0 1 2 System 1System 2System 3System 4System 5 Topic 1 Topic 2 Transform them back into the [0,1] range!
  • 8. std-CDF: use the cumulative density function of the standard normal distribution [Webber+08] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TREC04 Each curve is a topic, with 110 runs represented as dots raw nDCG std-CDF nDCG
  • 9. std-CDF: emphasises moderately high and moderately low performers ā€“ is this a good thing? 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 TREC04 raw nDCG std-CDF nDCG Moderately high Moderately low
  • 10. std-AB: How about a simple linear transformation? [Sakai16ICTIR] 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 std-CDF nDCG std-AB nDCG (A=0.10) std-AB nDCG (A=0.15) TREC04 raw nDCG
  • 11. std-AB with clipping, with the range [0,1] Let B=0.5 (ā€œaverageā€ system) Let A=0.15 so that 89% of scores fall within [0.05, 0.95] (Chebyshevā€™s inequality) For EXTREMELY good/bad systemsā€¦ This formula with (A,B) is used in educational research: A=100, B=500 for SAT, GRE [Lodico+10], A=10, B=50 for Japanese hensachi ā€œstandard scoresā€.
  • 12. In practice, clipping does not happen often. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 TREC04 raw nDCG 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 TREC04 std-AB nDCG Topic ID
  • 13. [Sakai16ICTIR] bottom line ā€¢ Advantages of score standardisation: - removes topic hardness, enables comparison across test collections - normalisation becomes unnecessary ā€¢ Advantages of std-AB over std-CDF: Low within-system variances and therefore - Substantially lower swap rates (higher consistency across different data) - Enables us to consider realistic topic set sizes in topic set design Swap rates for std-CDF can be higher than those for raw scores, probably due to its nonlinear transformation std-AB is a good alternative to std-CDF.
  • 14. TALK OUTLINE 1. Score standardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 15. Topic set size design (1) [Sakai16IRJ] ā€¢ Provides answers to the following question: ā€œIā€™m building a new test collection. How many topics should I create?ā€ ā€¢ A prerequisite: a small topic-by-run score matrix based on pilot data, for estimating within-system variances. ā€¢ Three approaches (with easy-to-use Excel tools), based on [Nagata03]: (1) paired t-test power (2) one-way ANOVA power (3) confidence interval width upperbound.
  • 16. Topic set size design (2) [Sakai16IRJ] Method Input required Paired t-test Ī± (Type I error probability), Ī² (Type II error probability), minDt (minimum detectable difference: whenever the diff between two systems is this much or larger, we want to guarantee (1-Ī²)% power), : variance estimate for the score delta. one-way ANOVA Ī± (Type I error probability), Ī² (Type II error probability), m (number of systems), minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-Ī²)% power), : estimate of the within-system variance under the homoscedasticity assumption. Confidence intervals Ī± (Type I error probability), Ī“ (CI width upperbound: you want the CI for the diff between any system pair to be this much or smaller), : variance estimate for the score delta.
  • 17. Topic set size design (3) [Sakai16IRJ] Test collection designs should evolve based on past data topic-by-run score matrix with pilot data About 25 topics with runs from a few teams probably sufficient [Sakai+16EVIA] n1 topics m runs Estimate n1 based on the within-system variance estimate TREC 201X TREC 201(X+1) n2 topics n0 topics Estimate n2 based on the within-system variance estimate A more accurate estimate
  • 18. Topic set size design (4) [Sakai16IRJ] ANOVA-based results for m=10 can be used instead of CI-based results ANOVA-based results for m=2 can be used instead of t-test-based results In practice, you can deduce t-test-based and CI-based results from ANOVA-based results Caveat: the ANOVA-based tool can only handle (Ī±, Ī²)=(0.05, 0.20), (0.01, 0.20), (0.05, 0.10), (0.01, 0.10).
  • 19. Method Input required one-way ANOVA Ī± (Type I error probability), Ī² (Type II error probability), m (number of systems), minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-Ī²)% power), : estimate of the within-system variance under the homoscedasticity assumption. Example situation: You plan to compare m systems with one-way ANOVA with Ī±=5%. You plan to use nDCG as a primary evaluation measure, and want to guarantee 80% power whenever the diff between the best and the worst systems >= minD. You know from pilot data that the within-system variance for nDCG is around . What is the required number of topics n? Topic set size design with one-way ANOVA (1) m systems best worst minD <= D
  • 20. http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx will do this for you! Use the appropriate sheet for a given (Ī±, Ī²) and fill out the orange cells. : n=20 is what you want! Topic set size design with one-way ANOVA (2)
  • 21. Estimating the variance (1) We need for topic set size design based on one-way ANOVA and for that based on the paired t-test or CI. From a pilot topic-by-run score matrix, obtain: Then, if possible, pool multiple estimates to enhance accuracy: Pooled estimate By-product of one-way ANOVA (use two-way w/o replilcation for tighter estimates) Multiple data not available in this study
  • 22. TALK OUTLINE 1. Score standardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 23. Variances obtained from NTCIR-12 tasks mC nC Variances are substantially smaller after applying std-AB. Unnormalised measures can be handled without any problems.
  • 24. Why the variances are smaller after applying std-AB The initial estimate of n with the one-way ANOVA topic set size design is given by [Nagata03] where, for (Ī±, Ī²)=(0.05, 0.20), Ī» ā‰’ So n will be small if is small. With std-AB, is indeed small because A is small (e.g. 0.15) and it can be shown that Noncentrality parameter of a noncentral chi-square distribution
  • 25. System rankings before and after applying std-AB mC nC System rankings before and after applying std-AB are statistically equivalent. std-AB enables cross-collection comparisons without affecting within-collection comparisons!
  • 26. MedNLPDoc (1) [Aramaki+16] https://sites.google.com/site/mednlpdoc/ ā€¢ INPUT: a medical record ā€¢ OUTPUT: ICD (international classification of diseases) codes of possible disease names ā€¢ MEASURES: precision and recall of ICDs precision recall 14 runs 14 runs 78 topics 76 topics
  • 27. MedNLPDoc (2) [Aramaki+16] https://sites.google.com/site/mednlpdoc/ 76 topics Raw recall: - Lots of 0ā€™s - Some 1ā€™s std-AB recall: - No 0ā€™s - Fewer 1ā€™s 0 100 200 300 400 500 600 700 0 50 100 150 200 250 300 350 score range score range
  • 28. MobileClick-2 iUnit ranking (1) [Kato+16] http://mobileclick.org/ ā€¢ INPUT: iUnits (relevant nuggets for a mobile search summary) ā€¢ OUTPUT: iUnits ranked by relevance ā€¢ MEASURES: nDCG [Jarvelin+02] = Ī£ g(r)/log(r+1) / Ī£ g*(r)/log(r+1) Q-measure [Sakai05AIRS04] = (1/R) Ī£ I(r) BR(r) where BR(r) = ( Ī£ I(k) + Ī² Ī£ g(k) )/( r + Ī²Ī£ g*(k) ) l r=1 l r=1 r r k=1 r k=1 r k=1 gain at r in an ideal list 1 if relevant, 0 otherwise
  • 29. MobileClick-2 iUnit ranking (2) [Kato+16] http://mobileclick.org/ Raw nDCG: - hard topics, easy topics 0 100 200 300 400 500 600 700 0 100 200 300 400 500 600 700 std-AB nDCG: - topics look more comparable to one another
  • 30. MobileClick-2 iUnit summarisation (1) [Kato+16] http://mobileclick.org/ ā€¢ INPUT: iUnits (relevant nuggets for a mobile search summary) ā€¢ OUTPUT: two-layered textual summary ā€¢ MEASURES: M-measure, a variant of the intent-aware U-measure [Sakai+13SIGIR] M-measure is an unnormalised measure: does not have the [0,1] range. (Intent-aware measures difficult to normalise.) [Kato+16]
  • 31. MobileClick-2 iUnit summarisation (2) [Kato+16] http://mobileclick.org/ Raw M-measure: - unnormalised, unbounded, extremely large variances - topics definitely not comparable (note the different scale of the y axis) std-AB M-measure: - no problem! 0 100 200 300 400 500 0 100 200 300 400 500 600 40-45 0.9-1.0 Clearly violates i.i.d
  • 32. STC (short text conversation) (1) [Shang+16] http://ntcir12.noahlab.com.hk/stc.htm ā€¢ INPUT: a Weibo post (Chinese tweet) ā€¢ OUTPUT: a ranked list of Weibo posts from a repository that serve as valid responses to the input ā€¢ MEASURES: nG@1 (normalised gain at 1, a.k.a. ā€œnDCG@1ā€) nERR@10 [Chapelle11] P+ [Sakai06AIRS] a variant of Q-measure
  • 33. STC (short text conversation) (2) [Shang+16] http://ntcir12.noahlab.com.hk/stc.htm Raw P+: - Lots of 1ā€™s 0ā€™s - Gap in the [0.625, 1] range (see previous slide) std-AB P+: - Looks like a continuous measure! - Fewer 1ā€™s - No 0ā€™s 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 0 500 1000 1500 0 500 1000 1500
  • 34. STC (short text conversation) (3) [Shang+16] http://ntcir12.noahlab.com.hk/stc.htm Raw nG@1: - 0 or 1/3 or 1! 0 1000 2000 3000 0 500 1000 1500 2000 2500 std-AB nG@1: - Looks like a continuous measure! - Fewer 1ā€™s - No 0ā€™s
  • 35. QALab-2 (1) [Shibuki+16] http://research.nii.ac.jp/qalab/ ā€¢ INPUT: a multiple-choice Japanese National Center Test (university entrance exam) question on world history ā€¢ OUTPUT: choice deemed correct by system ā€¢ MEASURES: Boolean: 1 (correct) or 0 (incorrect)
  • 36. QALab-2 (2) [Shibuki+16] http://research.nii.ac.jp/qalab/ 36 topicsRaw Boolean: - 0 or 1! std-AB Boolean: - Two distinct ranges of values [0.2999, 0.4460] and [0.6091, 0.9047] Normal assumption still clearly violated: our topic set size design results should be interpreted as those for normally-distributed measures that happen to have variances similar to Raw/std-AB Boolean. QALab-2 organisers sorted the topics by #systems_correctly_answered before providing the matrices to the present author 0 200 400 600 800 0 200 400 600
  • 37. TALK OUTLINE 1. Score standardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 38. A few recommendations for MedNLPDoc (1) With raw recall: create 100 topics to guarantee 80% power for - minD=0.10 for m=2 systems - minD=0.20 for m=50 systems MedNLPDoc had 76-78 topics at NTCIR-12.
  • 39. A few recommendations for MedNLPDoc (2) With std-AB recall: create 80 topics to guarantee 80% power for - minD=0.05 for m=2 systems - minD=0.10 for m=50 systems MedNLPDoc had 76-78 topics at NTCIR-12. Topic set size choices look much more practical when std-AB is used (due to low variance)
  • 40. A few recommendations for MobileClick-2 (1) MobileClick-2 had 100 topics at NTCIR-12. Topic set size needs to be set by considering both subtasks, but raw M-measure cannot be handled due to extremely large variance. If we only consider iUnit ranking raw nDCG@3: create 90 topics to guarantee 80% power for - minD=0.10 for m=10 English systems - minD=0.10 for m=2 Japanese systems
  • 41. A few recommendations for MobileClick-2 (2) MobileClick-2 had 100 topics at NTCIR-12. With std-AB nDCG@3 and std-AB M-measure: create 100 topics to guarantee 80% power for - minD=0.10 for m=20 English and m=30 Japanese iUnit ranking systems - minD=0.05 for m=10 English and m=10 Japanese iUnit summarisation systems
  • 42. A few recommendations for STC (1) With (a normally distributed measure whose variance is similar to that of) raw nG@1: create 120 topics to guarantee 80% power for - minD=0.20 for m=20 systems STC had 100 topics at NTCIR-12.
  • 43. A few recommendations for STC (2) STC had 100 topics at NTCIR-12. With std-AB nG@1: create 100 topics to guarantee 80% power for - minD=0.10 for m=30 systems Topic set size choices look much more practical when std-AB is used (due to low variance)
  • 44. A few recommendations for QALab-2 (1) QALab-2 had 36-41 topics at NTCIR-12: not sufficient from the viewpoint of power With (a normally distributed measure whose variance is similar to that of) raw Boolean: create 90 topics to guarantee 80% power for - minD=0.20 for m=2 systems
  • 45. A few recommendations for QALab-2 (2) QALab-2 had 36-41 topics at NTCIR-12. With (a normally distributed measure whose variance is similar to that of) std-AB Boolean: create 40 topics to guarantee 80% power for - minD=0.10 for m=2 systems - minD=0.20 for m=50 systems Topic set size choices look much more practical when std-AB is used (due to low variance)
  • 46. TALK OUTLINE 1. Score standardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 47. Conclusions ā€¢ std-AB suppresses score variances and thereby enables test collection builders to consider realistic choices of topic set sizes. ā€¢ topic set size design with std-AB can handle even unnormalised such as M-measure (U-measure, TBG, alpha-nDCG, ERR-IA etc.). ā€¢ Even discrete measures such as nG@1 (0 or 1/3 or 1) look more continuous after applying std-AB, which makes the topic set size design results (based on normality and i.i.d assumptions) perhaps a little more believable. ā€¢ Test collection designs should evolve based on experiences (i.e. variances pooled from past data).
  • 48. TALK OUTLINE 1. Score standardisation 2. Topic set size design 3. NTCIR-12 tasks 4. Results 5. Conclusions 6. Future work: NTCIR WWW
  • 49. How long will the standardisation factors for each topic remain valid? standardised score for i-th system, j-th topic j i raw Topics Systems j i std Topics Systems Subtract mean; divide by standard deviation How good is i compared to ā€œaverageā€ in standard deviation units? Standardising factors These systems will eventually become outdated, right?
  • 50. We Want Web@NTCIR-13 (1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) frozen topic set NTCIR-13 fresh topic set NTCIR-13 systems New runs pooled for frozen + fresh topics
  • 51. We Want Web@NTCIR-13 (2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) frozen topic set NTCIR-13 fresh topic set NTCIR-13 systems Official NTCIR-13 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13 systems NOT released Qrels + std. factors based on NTCIR-13 systems released
  • 52. We Want Web@NTCIR-14 (1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-13 systems NTCIR-14 systems New runs pooled for frozen + fresh topics Revived runs pooled for fresh topics
  • 53. We Want Web@NTCIR-14 (2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-13 systems NTCIR-14 systems Official NTCIR-14 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14 systems NOT released Qrels + std. factors based on NTCIR-(13+)14 systems released Using the NTCIR-14 fresh topics, compare new NTCIR- 14 runs with revived runs and quantify progress.
  • 54. We Want Web@NTCIR-15 (1) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems New runs pooled for frozen + fresh topics Revived runs pooled for fresh topics
  • 55. We Want Web@NTCIR-15 (2) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-(13+14+)15 systems released Using the NTCIR-15 fresh topics, compare new NTCIR- 15 runs with revived runs and quantify progress.
  • 56. We Want Web@NTCIR-15 (3) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14 systems released Qrels + std. factors based on NTCIR-13 systems released How do the standardisation factors for each frozen topic differ across the 3 rounds? Qrels + std. factors based on NTCIR-13+14+15 systems released Qrels + std. factors based on NTCIR-(13+14+)15 systems released
  • 57. We Want Web@NTCIR-15 (4) http://www.thuir.cn/ntcirwww/ NTCIR-13 (Dec 2017) NTCIR-14 (Jun 2019) NTCIR-15 (Dec 2020) frozen topic set frozen topic set frozen topic set NTCIR-13 fresh topic set NTCIR-14 fresh topic set NTCIR-15 fresh topic set NTCIR-13 systems NTCIR-14 systems NTCIR-15 systems Qrels + std. factors based on NTCIR-(13+14+)15 systems released Official NTCIR-15 results discussed with the fresh topics Qrels + std. factors based on NTCIR-13+14+15 systems released Qrels + std. factors based on NTCIR-13+14 systems released Qrels + std. factors based on NTCIR-13 systems released How do the NTCIR-15 system rankings differ across the 3 rounds, with and w/o standardisation? NTCIR-15 systems ranking NTCIR-15 systems ranking NTCIR-15 systems ranking
  • 58. See you all in Tokyo, in August/December 2017!
  • 59. Selected references (1) [Aramaki+16] Aramaki et al.: Overview of the NTCIR-12 MedNLPDoc task, NTCIR-12 Proceedings, 2016. [Carterette+08] Carterette et al.: Evaluation over Thousands of Queries, SIGIR 2008. [Chapelle+11] Chapelle et al.: Intent-based Diversification of Web Search Results: Metrics and Algorithms, Information Retrieval 14(6), 2011. [Jarvelin+02] Jarvelin and Kelalainen: Cumulated Gain-based Evaluation of IR techniques, ACM TOIS 20(4), 2002. [Gilbert+79] Gilbert and Sparck Jones:, Statistical Bases of Relevance assessment for the `IDEALā€™ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1979. [Kato+16] Kato et al.: Overview of the NTCIR-12 MobileClick task, NTCIR-12 Proceedings, 2016. [Nagata03] Nagata: How to Design the Sample Size (in Japanese), Asakura Shoten, 2003.
  • 60. Selected references (2) [Sakai05AIRS04] Sakai: Ranking the NTCIR Systems based on Multigrade Relevance, AIRS 2004 (LNCS 3411), 2005. [Sakai06AIRS] Sakai: Bootstrap-based Comparisons of IR Metrics for Finding One Relevant Document, AIRS 2006 (LNCS 4182). [Sakai+13SIGIR] Sakai and Dou: Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation, SIGIR 2013. [Sakai16ICTIR] Sakai: A simple and effective approach to score standardisaiton, ICTIR 2016. [Sakai16ICTIRtutorial] Sakai: Topic set size design and power analysis in practice (tutorial), ICTIR 2016. [Sakai16IRJ] Sakai: Topic set size design, Information Retrieval, 19(3), 2016. OPEN ACCESS: http://link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf [Sakai+16EVIA] Sakai and Shang: On Estimating Variances for Topic Set Size Design, EVIA 2016. http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings12/pdf/evia/02- EVIA2016-SakaiT.pdf
  • 61. Selected references (3) [Shang+16] Shang et al.: Overview of the NTCIR-12 short text conversation task, NTCIR-12 Proceedings, 2016. [Shibuki+16] Shibuki et al.: Overview of the NTCIR-12 QA Lab-2 task, NTCIR-12 Proceedings, 2016. [SparckJones+75] Sparck Jones and Van Rijsbergen: Report on the Need for and Provision on an `Idealā€™ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1975. [Voorhees+05] Voorhees and Harman: TREC: Experiment and Evaluation in Information Retrieval, The MIT Press, 2005. [Voorhees09] Voorhees: Topic Set Size Redux, SIGIR 2009. [Webber+08SIGIR] Webber, Moffat, Zobel: Score standardisation for inter-collection comparison of retrieval systems, SIGIR 2008. [Webber+08CIKM] Webber, Moffat, Zobel: Statistical power in retrieval experimentation, CIKM 2008.