Sakai, T: The Effect of Inter-Assessor Disagreement on IR System Evaluation: A Case Study with Lancers and Students, Proceedings of EVIA 2017, pp.31-38, 2017.
http://ceur-ws.org/Vol-2008/paper_5.pdf
9. Comparison with [Voorhees00]
• TREC news docs vs clueweb12
• (topic originator + 2 additional judges) vs
(2 lancers + 1 student)
• binary vs graded relevance
• overlap, recall, precision vs linear weighted kappa
with 95%CIs + statistical significance
“The actual value of the effectiveness measure was affected by
the different conditions, but in each case the relative
performance of the retrieved runs was almost always the same.
These results validate the use of the TREC test collections for
comparative retrieval experiments.”
Different qrels,
very similar rankings
10. Comparison with [Maddalena+17]
• Five assessments vs three (lancer, lancer, student)
• Krippendorph’s α (disregards which assesments
come from which assessors [Sakai19CLEF@20]) vs
linear weighted kappa for every assessor pair (since
we’re interested in lancer‐lancer vs lancer‐student
agreements)
• absolute measure values and system rankings vs
absolute measure values, system rankings, and
statistical significance
• nDCG vs nDCG, Q‐measure, nERR (official WWW
measures)
25. High‐ and low‐agreement topics
mean (min/max) per‐topic linear weighted kappa
• 27 “high‐agreement” topics:
for every assessor pair, the 95%CI for kappa was above zero
• 23 “low‐agreement” topics:
at least for one assessor pair, the 95%CI for kappa included zero
Maybe test collection builders can consider removing some
low‐agreement (i.e., unreliable) topics beforehand?
Let’s compare the evaluation outcomes of
full/high‐agreement/low‐agreement sets (using “all3”).
Similar topic set sizes