Reliable evaluation of Information Retrieval systems requires large amounts of relevance judgments. Making these annotations is quite complex and tedious for many Music Information Retrieval tasks, so performing such evaluations requires too much effort. A low-cost alternative is the application of Minimal Test Collection algorithms, which offer quite reliable results while significantly reducing the annotation effort. The idea is to incrementally select what documents to judge so that we can compute estimates of the effectiveness differences between systems with a certain degree of confidence. In this paper we show a first approach towards its application to the evaluation of the Audio Music Similarity and Retrieval task, run by the annual MIREX evaluation campaign. An analysis with the MIREX 2011 data shows that the judging effort can be reduced to about 35% to obtain results with 95% confidence.
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval
1. Towards Minimal Test Collections
for Evaluation of
Audio Music Similarity and Retrieval
@julian_urbano @m_schedl
University Carlos III of Madrid Johannes Kepler University
AdMIRe 2012
Picture by ERdi43 (Wikipedia) Lyon, France · April 17th
2. Problem
evaluation of IR systems is costly
Annotations
time consuming
expensive
boring
(Bad) Consequence
small and biased test collections
unlikely to change from year to year
Solution
apply low-cost evaluation methodologies
3. nearly 2 decades of
Meta-Evaluation in Text IR
some good practices
inherited from here
NTCIR CLEF
Cranfield 2 MEDLARS SMART TREC (1999-today)
(2000-today)
(1962-1966) (1966-1967) (1992-today)
(1961-1995)
1960 2011
ISMIR MIREX
(2000-today)
(2005-today)
a lot of things
have happened here!
4. Minimal Test Collections (MTC) [Carterette at al.]
estimate the ranking of systems
with very few judgments (high incompleteness)
Application in Audio Music Similarity (AMS)
dozens of volunteers required by MIREX every year
to make thousands of judgments
Year Teams Systems Queries Results Judgments Overlap
2006 5 6 60 1,800 1,629 10%
2007 8 12 100 6,000 4,832 19%
2009 9 15 100 7,500 6,732 10%
2010 5 8 100 4,000 2,737 32%
2011 10 18 100 9,000 6,322 30%
6. Basic Idea
treat similarity scores as random variables
can be estimated with uncertainty
gain of an arbitrary document: Gi ⤳ multinomial
𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙
𝑙∈ℒ
ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100}
whenever document i is judged:
𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺 𝑖 = 0
*all variance formulas in the paper
7. AG@k is also treated as a random variable
1
𝐸 𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘
𝑘
𝑖∈𝒟
iterate all documents ranking at which
(in practice, only it was retrieved
the top k retrieved)
Ultimate Goal
compute a good estimate with the least effort
8. Comparing Two Systems
what really matters is the sign of the difference
1
𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
𝑘
𝑖∈𝒟
Evaluating Several Queries
1
𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝛥𝐴𝐺@𝑘 𝑞
𝒬
𝑞∈𝒬
iterate all queries
The Rationale
if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then
judge another document else stop judging
9. Distribution of AG@k
what are the possible assignments of similarity?
𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘
𝛾 𝑘 ∈𝛤 𝑘
ultimately
iterate all possible
depends on the
permutations of k
distribution of Gi
similarity assignments
Plain English
the ratio of similarity assignments s.t. AG@k=z
For Complex Measures or Large Similarity Scales
run Monte Carlo simulation
10. Actually, AG@k is a Special Case
let G be the similarity of the top k for all queries
query AG@k for a single query
1. take a sample of k documents. Mean = X1
2. take a sample of k documents. Mean = X2
...
Q. take a sample of k documents. Mean = XQ
Mean of sample means = X
mean AG@k over all queries
Central Limit Theorem
as Q→∞, X approximates a normal distribution
regardless of the distribution of G
11. AG@k is Normally Distributed
use the normal cumulative density function Φ
−𝐸 ∆𝐴𝐺@𝑘
𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ
𝑉𝑎𝑟 ∆𝐴𝐺@𝑘
BROAD scale FINE scale
0.030
0.0 0.2 0.4 0.6 0.8 1.0
0.020
Density
Density
0.010
0.000
0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100
AG@5 AG@5
12. Confidence as a Function of # Judgments
100
95
90
or waste
Confidence in ranking of systems
our time
or keep judging
85
we can to be really confident
80
stop judging
75
70
65
60
55
50
0 10 20 30 40 50 60 70 80 90 100
Percent of judgments
what documents should we judge?
those that maximize the confidence
13. The Trick
documents retrieved by both systems are useless
there is no need to judge them
whatever Gi is, it is added and then subtracted
Comparing Several Systems
compute a weight wi for each query-document
judge the document with largest effect
wi in the Original MTC
wi = largest weight across system pairs
reduces to # of system pairs affected by query-doc i
14. wi Dependent on Confidence
if we are highly confident about a pair of systems
we do not need to judge another of their documents
even if it has the largest weight
2
𝑤𝑖 = 1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘
𝐴,𝐵 ∈𝒮−ℛ weight inversely proportional
to confidence
iterate system pairs
with low confidence
better results than traditional weights
18. Why MIREX 2011
largest edition so far
18 systems (153 pairwise comparisons)
100 queries and 6,322 judgments
Distribution of Gi
let us work with a uniform distribution for now
19. Confidence as Judgments are Made
correct bins: estimated sign is correct or
not significant anyway
20. Confidence as Judgments are Made
correct bins: estimated sign is correct or
not significant anyway
21. Confidence as Judgments are Made
correct bins: estimated sign is correct or
not significant anyway
25. Accuracy as Judgments are Made
rankings with tau = 0.9 traditionally considered
equivalent (same as 95% accuracy)
26. high confidence
and
high accuracy
with considerably
less effort
27. Statistical Significance
MTC allows us to accurately estimate the ranking
but for the current set of queries
can we generalize to a general set of queries?
Not Trivial
we have the variance of the estimates
but not the sample variance
28. Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A
Lower bound: best case for B
1
∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
𝑘
𝑖∈𝜋 known judgments
1
+ 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
𝑘
𝑖∈𝜋
1
− 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
𝑘
𝑖∈𝜋
*same for the lower bound
29. Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A
Lower bound: best case for B
1
∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
𝑘
𝑖∈𝜋 retrieved by A
1
+ 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
best 𝑘
similarity 𝑖∈𝜋 unknown judgments
score 1
− 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
𝑘
𝑖∈𝜋
*same for the lower bound
30. Work with Upper and Lower Bounds of ΔAG@k
Upper bound: best case for A
Lower bound: best case for B
1
∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
𝑘
𝑖∈𝜋
1
+ 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
𝑘 retrieved by B
𝑖∈𝜋
but not by A
1
− 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
worst 𝑘
similarity 𝑖∈𝜋 unknown judgments
score
*same for the lower bound
31. 3 Rules
1. Assume best case for A (upper bound)
if A <<< B then conclude A <<< B
2. Assume best case for B (lower bound)
if B <<< A then conclude B <<< A
3. If in the best case for A we do not have A >>> B
and in the best case for B we do not have B >>> A
then conclude they are not significantly different
Problem
upper and lower bounds are very unrealistic
32. Incorporate a Heuristic
4. If the estimated difference is larger than t
naively conclude significance
Choose t Based on Power Analysis
t = effect-size detectable by a t-test with
• sample variance σ2=0.0615 from previous
• sample size n=100 MIREX editions
• Type I Error rate α=0.05 typical values
• Type II Error rate β=0.15
t ≈ 0.067
33. Accuracy of the Significance Estimates
pretty good
around 95% confidence
rule 4 (heuristic) ends up
overestimating significance
34. Accuracy of the Significance Estimates
rules 1 to 3 begin to apply
and correct overestimations
rule 4 (heuristic) ends up
overestimating significance
35. Accuracy of the Significance Estimates
closer to
expected
never under 90%
40. Learn the true Distribution of Similarity Judgments
it‘s clearly not uniform
would give more accurate estimates with less effort
use previous AMS data or fit a model as we judge
Significance Testing with Incomplete Judgments
best-case scenarios are very unrealistic
Study Low-Cost Methodologies for other MIR Tasks
42. MTC Greatly Reduces the Effort for AMS (and SMS)
have MIREX volunteers incrementally create
brand new test collections for other tasks
Better Yet
study low-cost methodologies for the other tasks
Not Only for MIREX
private collections for in-house evaluations
no possibility of gathering large pools of annotators
lost-cost becomes paramount
43. the MIR community
needs a paradigm shift
from a priori to a posteriori
evaluation methods
to reduce cost
and gain reliability