Evia2017assessors

The Effect of Inter‐Assessor
Disagreement on IR System
Evaluation:
A Case Study with Lancers and
Students
Tetsuya Sakai
tetsuyasakai@acm.org
Waseda University, Japan
December 5, 2017@EVIA 2017, NII

TALK OUTLINE
1. Motivation: NTCIR‐13 WWW and Lancers
2. Prior art
3. Collecting relevance assessments
4. Inter‐assessor agreement
5. Using high‐agreement topics only
6. EVIA reviewers’ comments
7. Conclusions and future work

We ran the NTCIR‐13
We Want Web task because

NTCIR‐13 WWW data [Luo+17]
SogouT‐16
ClueWeb12‐B13
2 “Lancers” for 100 topics, plus a Student
assessor for odd‐numbered ones

Japanese crowdsourcing site
https://www.lancers.jp/
"Lancers claims to be Japan's largest freelancing site,
producing over $200 million worth of accumulated
freelancing gigs to date“ (Jan 28, 2014)
https://www.techinasia.com/lancers‐produces‐200‐million‐freelancing‐gigs‐growing
++ I can easily find workers with a high reputation
and necessary skills online
‐‐ I need to pay about 20% on top of the workers’
fees to the company
‐‐ Lancers.jp refuses to get paid directly from
University (I pay to Lancers.jp then University
reimburses me)

Research questions (and answers)
RQ1: Is hiring lancers worthwhile?
‐ Maybe.
RQ2: How do lancers compare to students as
relevance assessors? (lancers and students get paid
equally per work hour.)
‐ See below.
One of the main findings:
‐ Lancer‐lancer agreements statistically significantly
higher than lancer‐student agreements.
‐ Lancers’ qrels more discriminative than Students’ qrels

Prior art: inter‐assessor agreement
(not exhaustive)
[Voorhees00]
[Bailey+08] contains a concise survey covering 1969‐2008
[Carterete+10]
[Webber+12]
[Demeester+14]
[Megorskaya+15]
[Wang+15]
[Ferrante+17]
[Maddalena+17]
Variations in Relevance Judgments and the
Measurement of Retrieval Effectiveness
Considering Assessor Agreement in IR
Evaluation

Comparison with [Voorhees00]
• TREC news docs vs clueweb12
• (topic originator + 2 additional judges) vs
(2 lancers + 1 student)
• binary vs graded relevance
• overlap, recall, precision vs linear weighted kappa
with 95%CIs + statistical significance
“The actual value of the effectiveness measure was affected by
the different conditions, but in each case the relative
performance of the retrieved runs was almost always the same.
These results validate the use of the TREC test collections for
comparative retrieval experiments.”
Different qrels,
very similar rankings

Comparison with [Maddalena+17]
• Five assessments vs three (lancer, lancer, student)
• Krippendorph’s α (disregards which assesments
come from which assessors [Sakai19CLEF@20]) vs
linear weighted kappa for every assessor pair (since
we’re interested in lancer‐lancer vs lancer‐student
agreements)
• absolute measure values and system rankings vs
absolute measure values, system rankings, and
statistical significance
• nDCG vs nDCG, Q‐measure, nERR (official WWW
measures)

Topic assignment
• Nine English‐speaking lancers
• Five English‐speaking students from Waseda U
• For each (odd‐numbered) topic, 2 lancers and 1
student were assigned at random
• The official NTCIR‐13 WWW results are based on
100 topics with 2 lancers per topic
• The present study focusses on the 50 odd‐
numbered topics, each judged by 3 assessors

Instructions for assessors
Full document available at
http://waseda.box.com/lancers4www
ERROR: the right panel does not show any contents at all, even after
waiting for a few seconds for the content to load.
H.REL: highly relevant ‐ it is *likely* that the user who entered this search
query will find this page relevant.
REL: relevant ‐ it is *possible* that the user who entered this search query
will find this page relevant.
NONREL: nonrelevant ‐ it is *unlikely* that the user who entered this
search query will find this page relevant.
0
0
2
1

Qrels versions
The final relevance levels were obtained by
simply summing up the individual assessments (0‐2).
See [Maddalena+17][Sakai17EVIA‐unanimity] for alternatives.
Note: “lancer1” and “lancer2” are based on assessments from
nine different lancers;
“student” is based on assessments from five different students.

Linear weighted kappa with 95%CI
lancer‐lancer agreement is statistically significantly higher
than lancer‐student agreements.

Mean (min/max) per‐topic linear
weighted kappa
‐ lancer‐lancer still higher than lancer‐student
‐ For some topics, no agreement beyond chance
but this is not uncommon; see [Bailey+08]
#topics whose 95%CI lower limit for kappa was not positive

Binary kappa with 95%CI
(1,2 mapped to 1)
lancer‐lancer agreement is statistically significantly higher
than lancer‐student agreements.

Binary raw agreement (does not take
chance agreement into account)
Still highest

System rankings by different qrels versions
nDCG@10 (see paper for Q and nERR)
τ with 95%CIs
Rankings not statistically significantly
different, but not identical

Statistical significance overlap at α=5%
(Randomised Tukey HSD test, 13*12/2=78 run pairs)
nDCG@10 (see paper for Q and nERR)
Different qrels can lead to different conclusions.
lancers1,2 quite similar, but lancer1 vs student not so.
Significant with
lancer1
Significant with
lancer2
255 1

Discriminative power at α=5%
(Randomised Tukey HSD test, 13*12/2=78 run pairs)
Relevance
levels
L0‐L6
L0‐L4
L0‐L2
L0‐L2
L0‐L2
• Fine‐grained relevance levels give high discriminative power
(more conclusions at α=5%)
• student qrels least discriminative

High‐ and low‐agreement topics
mean (min/max) per‐topic linear weighted kappa
• 27 “high‐agreement” topics:
for every assessor pair, the 95%CI for kappa was above zero
• 23 “low‐agreement” topics:
at least for one assessor pair, the 95%CI for kappa included zero
Maybe test collection builders can consider removing some
low‐agreement (i.e., unreliable) topics beforehand?
Let’s compare the evaluation outcomes of
full/high‐agreement/low‐agreement sets (using “all3”).
Similar topic set sizes

System rankings by
full/high‐agreement/low‐agreement
topic sets
Statistically significantly different rankings in blue
• Ranking by the low‐agreement set statistically significantly different from
that by the high‐agreement set
• Ranking by the high‐agreement set similar to that of the full set ranking
(= high‐agreement topics dominate the results)
τ with 95%CIs

Statistical significance overlap at α=5%
Significant with
high‐agreement set
Significant with
low‐agreement set
510 1

Discriminative power at α=5%
• Low‐agreement topics ⇒ low discriminative power
(high‐agreement vs low‐agreement substantially different
discriminative power despite similar topic set sizes)
• Indeed it may be useful to consider removing some low‐
agreement topics within the process of test collection
construction

Reviewer 4
“This paper presents a very interesting idea: some
topics are more "stable" than other topics in that
they are the topics that seem to determine the
ranking of systems in a test collection. These
stable/good topics happen to be topics that human
assessors better agree on what is and is not
relevant. Now, why is this? The paper does not tell
us, but I think it is a really neat finding and I think
other people will be able to build on this work. Thus,
this paper should be accepted.”
⇒ Thank you.

Reviewer 1
“Of particular interest was the idea of using only the 27
topics with significant agreement between all pairs of
assessors [...] The authors note this is a small number of
groups and therefore the results are somewhat
tentative. A different point that I would like to
emphasize is that the difference across topic sets is more
than just agreements; variability across topics has always
been more than variation across systems so this needs to
be considered in terms of recommendations of using this
type of set reduction.”
⇒ I agree that maintaining variability across topics is
important. See also Topic Set Size Design [Sakai16IRJ].

Reviewer 3
[Maddalena+17] proposed a similar analysis of
agreement between assessors and they also studied
the effect of removing high or low agreement topics
on system rankings. Can the authors point out the
differences between their analysis and the one
proposed in [Maddalena+17]?”
⇒ Sorry for missing this in the submitted version.
Please refer to an earlier slide.

Reviewer 2
“<Weakness> The framework is not technically sound from an IR point of
view, which is due to that the authors did not envision the IR evaluation
from a holistic point of view. The evaluation as in this paper is a trial of
batch execution, followed by an error analysis to fix problems in an IR
system. This repeats until the system satisfies certain conditions. However,
in this paper, the result of each run cannot be compared across the trials,
due to more than one reason, such as, a different set of topics may be
used depending on the trial. It is not possible to use this method for
continuously improving a system.
Besides this, a lot of important concepts in this paper lack the formal
definition and real (or even hypothetical) working examples, while the
authors were extremely concerned with mathematical trifles. Whereas
most of the key concepts are derivatives of the “reliability”, this concept
itself has not been well defined to begin with. Those drawbacks make this
evaluation unreliable. The novelty of this paper is not clear. It is also not
clear to what extent this evaluation method has increased the reliability of
the evaluation while maintaining its accuracy, compared with existing
evaluation methods.”
⇒ …

Conclusions
Main findings from a case study with the English NTCIR‐
13 WWW test collection (with only 13 runs):
‐ lancer‐lancer agreements statistically significantly
higher than lancer‐student agreements. Lancers’ qrels
more discriminative than Students’ qrels.
‐ Combining multiple assessors’ labels to form fine‐
grained relevance levels gives us high discriminative
power.
‐ High‐agreement topic set more useful for obtaining
concrete research conclusions than low‐agreement
ones. Consider removing low‐agreement ones while
maintaining topic variability?
Future work: in‐depth analysis of the assessors’ activity log data
RQ1, RQ2

I thank
• PLY = Peng Xiao, Lingtao Li, Yimeng Fan of the Sakai
Lab who developed the PLY relevance assessment
tool
• NTCIR‐13 WWW organisers
• NTCIR‐13 WWW participants

References
[Bailey+08] Relevance Assessment: Are Judges Exchangeable and Does It Matter? SIGIR 2008.
[Carterette+10] The Effect of Assessor Errors on IR System Evaluation, SIGIR 2010.
[Demeester+14] Exploiting User Disagreement for Web Search Evaluation: an Experimental Approach,
WSDM 2014.
[Ferrante+17] AWARE: Exploiting Evaluation Measures to Combine Multiple Assessors, TOIS 36(2), 2017.
[Luo+17] Overview of the NTCIR‐13 We Want Web Task, NTCIR‐13, 2017.
[Maddalena+17] Considering Assessor Agreement in IR Evaluation, ICTIR 2017.
[Megorskaya+15] On the Relation between Assessors’ Agreement and Accuracy in Gamified Relevance
Assessment, SIGIR 2015.
[Sakai16IRJ] Topic Set Size Design, Information Retrieval Journal 19(3), 2016.
http://link.springer.com/content/pdf/10.1007%2Fs10791‐015‐9273‐z.pdf
[Sakai17EVIA‐unanimity] Unanimity‐Aware Gain for Highly Subjective Assessments, EVIA 2017.
[Sakai19CLEF@20] How to Run an Evaluation Task, a book chapter to appear in 2019 from Springer.
[Voorhees00] Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness,
Information Processing and Management, 2000.
[Wang+15] Assessor Differences and User Preferences in Tweet Timeline Generation, SIGIR 2015.
[Webber+12] Alternative Assessor Disagreement and Retrieval Depth, CIKM 2012.

Evia2017assessors

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Evia2017assessors

Ähnlich wie Evia2017assessors (20)

Mehr von Tetsuya Sakai

Mehr von Tetsuya Sakai (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Evia2017assessors