This document summarizes a paper about the effect of tie-breaking bias on information retrieval evaluation. The paper discusses how tie-breaking when evaluating runs submitted to information retrieval tasks can introduce bias. It presents alternative reordering strategies to address this issue and experiments showing the impact of tie-breaking on measures like reciprocal rank, average precision, and mean average precision using data from past TREC tasks. The paper concludes that the tie-breaking bias can cause significant differences in evaluation measures, though it may not change system rankings, and more work is needed to further study its effects and develop reordering-free evaluation measures.
Le renfort des liens forts - dynamique relationnelle du coauthorship
CLEF 2010 - Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation
1. CLEF’10: Conference on Multilingual and Multimodal
Information Access Evaluation
September 20-23, Padua, Italy
Tie-Breaking Bias:
Effect of an Uncontrolled Parameter
on Information Retrieval Evaluation
Guillaume Cabanac, Gilles Hubert,
Mohand Boughanem, Claude Chrisment
2. Effect of the Tie-Breaking Bias G. Cabanac et al.
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
2
3. Effect of the Tie-Breaking Bias G. Cabanac et al.
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
3
4. 1. Motivation Tie-breaking bias illustration G. Cabanac et al.
A tale about two TREC participants (1/2)
Topic 031 “satellite launch contracts” 5 relevant documents
Chris Ellen
one single difference
C = ( , 0.8), ( , 0.8), ( , 0.5) E = ( , 0.8), ( , 0.8), ( , 0.5)
unlucky lucky
Why such a huge difference? 4
5. 1. Motivation Tie-breaking bias illustration G. Cabanac et al.
A tale about two TREC participants (2/2)
Chris Ellen
one single difference
C = ( , 0.8), ( , 0.8), ( , 0.5) E = ( , 0.8), ( , 0.8), ( , 0.5)
After 15 days of hard work
Only difference: the name of one document 5
6. Effect of the Tie-Breaking Bias G. Cabanac et al.
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
6
7. 2. Context & issue Tie-breaking bias G. Cabanac et al.
Measuring the effectiveness of IRSs
User-centered vs. System-focused [Spärk Jones & Willett, 1997]
Evaluation campaigns
1958 Cranfield UK
1992 TREC Text Retrieval Conference USA
1999 NTCIR NII Test Collection for IR Systems Japan
2001 CLEF Cross-Language Evaluation Forum Europe
…
“Cranfield” methodology
Task
Test collection
Corpus
Topics
Qrels
Measures : MAP, P@X ...
7
using trec_eval [Voorhees, 2007]
8. 2. Context & issue Tie-breaking bias G. Cabanac et al.
Runs are reordered prior to their evaluation
Qrels = qid, iter, docno, rel Run = qid, iter, docno, rank, sim, run_id
( , 0.8), ( , 0.8), ( , 0.5)
Reordering by trec_eval
qid asc, sim desc, docno desc
( , 0.8), ( , 0.8), ( , 0.5)
Effectiveness measure = f (intrinsic_quality, )
MAP, P@X, MRR… 8
9. Effect of the Tie-Breaking Bias G. Cabanac et al.
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
9
10. 3. Contribution Reordering strategies G. Cabanac et al.
Consequences of run reordering
Measures of effectiveness for an IRS s
RR(s,t) 1/rank of the 1st relevant document, for topic t
P(s,t,d) precision at document d, for topic t Sensitive to
AP(s,t) average precision for topic t document
rank
MAP(s) mean average precision
Tie-breaking bias
Ellen
Chris
Is the Wall Street Journal collection more relevant than Associated Press?
Problem 1 comparing 2 systems AP(s1, t) vs. AP(s2, t)
Problem 2 comparing 2 topics AP(s, t1) vs. AP(s, t2) 10
11. 3. Contribution Reordering strategies G. Cabanac et al.
Alternative unbiased reordering strategies
ex aequo
ex aequo
Conventional reordering (TREC)
Ties sorted Z A qid asc, sim desc, docno desc
Realistic reordering
Relevant docs last qid asc, sim desc, rel asc, docno desc
Optimistic reordering
11
Relevant docs first qid asc, sim desc, rel desc, docno desc
12. Effect of the Tie-Breaking Bias G. Cabanac et al.
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
12
13. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Effect of the tie-breaking bias
Study of 4 TREC tasks
1993 1997 1998 1999 2000 2002 2004 2009
routing filtering web
adhoc
22 editions
3 GB of data from trec.nist.gov
1360 runs
Assessing the effect of tie-breaking
Proportion of document ties How frequent is the bias?
Effect on measure values
Top 3 observed differences
Observed difference in %
Significance of the observed difference: Student’s t-test (paired, unilateral)
13
14. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Ties demographics
89.6% of the runs comprise ties
Ties are present all along the runs
14
15. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Proportion of tied documents in submitted runs
15
On average, 25.2 % of a result-list = tied documents On average, 10.6 docs in a tied group of docs
16. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Effect on Reciprocal Rank (RR)
16
17. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Effect on Average Precision (AP)
17
18. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Effect on Mean Average Precision (MAP)
Difference of ranks computed
on MAP not significant
(Kendall’s t)
18
19. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
What we learnt: Beware of tie-breaking for AP
Poor effect on MAP, larger effect on AP
Measure bounds APRealistic APConventionnal APOptimistic
padre1, adhoc’94
Failure analysis for the ranking process
Error bar = element of chance potential for improvement
19
20. 4. Experiments Impact of the tie-breaking bias G. Cabanac et al.
Related works in IR evaluation
Topics reliability?
[Buckley & Voorhees, 2000] 25
[Voorhees & Buckley, 2002] error rate
[Voorhees, 2009] n collections
Qrels reliability?
[Voorhees, 1998] quality
[Al-Maskari et al., 2008] TREC vs. TREC
[Voorhees, 2007]
Measures reliability?
[Buckley & Voorhees, 2000] MAP
[Sakai, 2008] ‘system bias’
[Moffat & Zobel, 2008] new measures
[Raghavan et al., 1989] Precall Pooling reliability?
[McSherry & Najork, 2008] Tied scores [Zobel, 1998] approximation
[Sanderson & Joho, 2004] manual
[Cabanac et al., 2010] tie-breaking bias [Buckley et al., 2007] size adaptation 20
21. Effect of the Tie-Breaking Bias G. Cabanac et al.
Outline
1. Motivation A tale about two TREC participants
2. Context IRS effectiveness evaluation
Issue Tie-breaking bias effects
3. Contribution Reordering strategies
4. Experiments Impact of the tie-breaking bias
5. Conclusion and Future Works
21
22. Impact du « biais des ex aequo » dans les évaluations de RI G. Cabanac et al.
Conclusions and future works
Context: IR evaluation
TREC and other campaigns based on trec_eval
Contributions
Measure = f (intrinsic_quality, luck) tie-breaking bias
Measure bounds (realistic conventional optimistic)
Study of the tie-breaking bias effect
(conventional, realistic) for RR, AP and MAP
Strong correlation, yet significant difference
No difference on system rankings (based on MAP)
Future works
Study of other / more recent evaluation campaigns
Reordering-free measures
22
Finer grained analyses: finding vs. ranking
23. CLEF’10: Conference on Multilingual and Multimodal
Information Access Evaluation
September 20-23, Padua, Italy
Thank you