SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
Towards Minimal Test Collections
                           for Evaluation of
                   Audio Music Similarity and Retrieval
                        @julian_urbano                      @m_schedl
                      University Carlos III of Madrid   Johannes Kepler University




                                                                            AdMIRe 2012
Picture by ERdi43 (Wikipedia)                                    Lyon, France · April 17th
Problem
   evaluation of IR systems is costly
             Annotations
           time consuming
              expensive
               boring
           (Bad) Consequence
    small and biased test collections
  unlikely to change from year to year
                Solution
apply low-cost evaluation methodologies
nearly 2 decades of
                                         Meta-Evaluation in Text IR
             some good practices
             inherited from here
                                                  NTCIR                  CLEF
Cranfield 2 MEDLARS SMART          TREC           (1999-today)
                                                                        (2000-today)
  (1962-1966) (1966-1967)          (1992-today)
                    (1961-1995)
1960                                                                                          2011

                                                         ISMIR                MIREX
                                                         (2000-today)
                                                                               (2005-today)




                                           a lot of things
                                        have happened here!
Minimal Test Collections (MTC) [Carterette at al.]
        estimate the ranking of systems
 with very few judgments (high incompleteness)

   Application in Audio Music Similarity (AMS)
dozens of volunteers required by MIREX every year
        to make thousands of judgments
  Year Teams Systems Queries Results Judgments Overlap
  2006   5       6     60     1,800    1,629    10%
  2007   8      12    100     6,000    4,832    19%
  2009   9      15    100     7,500    6,732    10%
  2010   5       8    100     4,000    2,737    32%
  2011 10       18    100     9,000    6,322    30%
evaluation
        with
incomplete judgments
Basic Idea
       treat similarity scores as random variables
       can be estimated            with uncertainty

       gain of an arbitrary document: Gi ⤳ multinomial

                     𝐸 𝐺𝑖 =           𝑃 𝐺𝑖 = 𝑙 · 𝑙
                               𝑙∈ℒ

       ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2              ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100}

                whenever document i is judged:
                   𝐸 𝐺𝑖 = 𝑙   𝑉𝑎𝑟 𝐺 𝑖 = 0
*all variance formulas in the paper
AG@k is also treated as a random variable

                 1
        𝐸 𝐴𝐺@𝑘 =               𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘
                 𝑘
                         𝑖∈𝒟

 iterate all documents              ranking at which
    (in practice, only               it was retrieved
  the top k retrieved)


              Ultimate Goal
compute a good estimate with the least effort
Comparing Two Systems
what really matters is the sign of the difference
           1
 𝐸 𝛥𝐴𝐺@𝑘 =              𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘
           𝑘
                  𝑖∈𝒟

          Evaluating Several Queries
                  1
        𝐸 𝛥𝐴𝐺@𝑘 =                  𝐸 𝛥𝐴𝐺@𝑘 𝑞
                  𝒬
                             𝑞∈𝒬
       iterate all queries

                   The Rationale
     if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then
 judge another document else stop judging
Distribution of AG@k
what are the possible assignments of similarity?

  𝑃 𝐴𝐺@𝑘 = 𝓏 ≔                      𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘
                         𝛾 𝑘 ∈𝛤 𝑘
                                          ultimately
  iterate all possible
                                       depends on the
  permutations of k
                                      distribution of Gi
similarity assignments

                   Plain English
 the ratio of similarity assignments s.t. AG@k=z
For Complex Measures or Large Similarity Scales
         run Monte Carlo simulation
Actually, AG@k is a Special Case
 let G be the similarity of the top k for all queries
     query         AG@k for a single query

1. take a sample of k documents. Mean = X1
2. take a sample of k documents. Mean = X2
                       ...
Q. take a sample of k documents. Mean = XQ
                Mean of sample means = X
                mean AG@k over all queries

            Central Limit Theorem
 as Q→∞, X approximates a normal distribution
      regardless of the distribution of G
AG@k is Normally Distributed
use the normal cumulative density function Φ
                                                                              −𝐸 ∆𝐴𝐺@𝑘
                                     𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ
                                                                                      𝑉𝑎𝑟 ∆𝐴𝐺@𝑘
                                           BROAD scale                                         FINE scale




                                                                              0.030
           0.0 0.2 0.4 0.6 0.8 1.0




                                                                              0.020
                                                                    Density
 Density




                                                                              0.010
                                                                              0.000




                                     0.0   0.5    1.0   1.5   2.0                     0   20    40   60     80   100

                                                 AG@5                                            AG@5
Confidence as a Function of # Judgments

                                    100
                                    95
                                    90
                                                                                                         or waste
 Confidence in ranking of systems



                                                                                                         our time
                                                                          or keep judging
                                    85




                                                      we can           to be really confident
                                    80




                                                   stop judging
                                    75
                                    70
                                    65
                                    60
                                    55
                                    50




                                          0   10   20   30    40      50     60     70   80   90   100
                                                             Percent of judgments


                                     what documents should we judge?
                                    those that maximize the confidence
The Trick
documents retrieved by both systems are useless
       there is no need to judge them
 whatever Gi is, it is added and then subtracted

          Comparing Several Systems
  compute a weight wi for each query-document
     judge the document with largest effect

               wi in the Original MTC
      wi = largest weight across system pairs
reduces to # of system pairs affected by query-doc i
wi Dependent on Confidence
 if we are highly confident about a pair of systems
we do not need to judge another of their documents

           even if it has the largest weight

                                                             2
   𝑤𝑖 =              1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘
          𝐴,𝐵 ∈𝒮−ℛ           weight inversely proportional
                                     to confidence
             iterate system pairs
             with low confidence

      better results than traditional weights
MTC for
   AMS
with AG@k
MTC for ΔAG@k
             average confidence on the ranking

        1
while          𝐴,𝐵 ∈𝒮
                        𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do
        𝒮
                                            select the
         𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖              best document

                 from all unjudged query-documents
        judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ )
          𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗
          𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0
                        update
                 (increase confidence)
end while
MTC in
MIREX AMS 2011
Why MIREX 2011
            largest edition so far
   18 systems (153 pairwise comparisons)
      100 queries and 6,322 judgments


              Distribution of Gi
let us work with a uniform distribution for now
Confidence as Judgments are Made




correct bins: estimated sign is correct or
              not significant anyway
Confidence as Judgments are Made




correct bins: estimated sign is correct or
              not significant anyway
Confidence as Judgments are Made




correct bins: estimated sign is correct or
              not significant anyway
high confidence
with considerably
   less effort
Accuracy as Judgments are Made
       estimated bins always
       better than expected
Accuracy as Judgments are Made




 estimated signs
highly correlated
with confidence
Accuracy as Judgments are Made




rankings with tau = 0.9 traditionally considered
      equivalent (same as 95% accuracy)
high confidence
       and
 high accuracy
with considerably
   less effort
Statistical Significance
MTC allows us to accurately estimate the ranking
        but for the current set of queries
 can we generalize to a general set of queries?


                   Not Trivial
     we have the variance of the estimates
         but not the sample variance
Work with Upper and Lower Bounds of ΔAG@k
               Upper bound: best case for A
               Lower bound: best case for B

           1
   ∆𝐴𝐺@𝑘 =          𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘   +
            𝑘
               𝑖∈𝜋          known judgments
           1
         +         𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
           𝑘
              𝑖∈𝜋
           1
         −         𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
           𝑘
                  𝑖∈𝜋



*same for the lower bound
Work with Upper and Lower Bounds of ΔAG@k
               Upper bound: best case for A
               Lower bound: best case for B

               1
   ∆𝐴𝐺@𝑘 =              𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘     +
                𝑘
                   𝑖∈𝜋               retrieved by A
               1
             +         𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
     best      𝑘
  similarity      𝑖∈𝜋          unknown judgments
    score      1
             −         𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
               𝑘
                  𝑖∈𝜋



*same for the lower bound
Work with Upper and Lower Bounds of ΔAG@k
               Upper bound: best case for A
               Lower bound: best case for B

            1
   ∆𝐴𝐺@𝑘 =           𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 +
             𝑘
                𝑖∈𝜋
            1
          +         𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 −
            𝑘                            retrieved by B
               𝑖∈𝜋
                                          but not by A
            1
          −         𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘
    worst   𝑘
  similarity      𝑖∈𝜋        unknown judgments
    score

*same for the lower bound
3 Rules
1. Assume best case for A (upper bound)
   if A <<< B then conclude A <<< B

2. Assume best case for B (lower bound)
   if B <<< A then conclude B <<< A

3. If in the best case for A we do not have A >>> B
   and in the best case for B we do not have B >>> A
   then conclude they are not significantly different
                     Problem
    upper and lower bounds are very unrealistic
Incorporate a Heuristic
4. If the estimated difference is larger than t
   naively conclude significance

         Choose t Based on Power Analysis
t = effect-size detectable by a t-test with
    • sample variance σ2=0.0615 from previous
    • sample size n=100               MIREX editions

    • Type I Error rate α=0.05         typical values
    • Type II Error rate β=0.15

                        t ≈ 0.067
Accuracy of the Significance Estimates




                              pretty good
                              around 95% confidence




 rule 4 (heuristic) ends up
overestimating significance
Accuracy of the Significance Estimates

                              rules 1 to 3 begin to apply
                              and correct overestimations




 rule 4 (heuristic) ends up
overestimating significance
Accuracy of the Significance Estimates




                      closer to
                      expected




          never under 90%
significance
can be estimated
 fairly well too
what we did
Introduce MTC to the MIR folks

      Work out the Math
      for MTC with AG@k

See How Well it would have Done
         in AMS 2011
         quite well actually!
what now
Learn the true Distribution of Similarity Judgments
              it‘s clearly not uniform
would give more accurate estimates with less effort
 use previous AMS data or fit a model as we judge


 Significance Testing with Incomplete Judgments
      best-case scenarios are very unrealistic


Study Low-Cost Methodologies for other MIR Tasks
what for
MTC Greatly Reduces the Effort for AMS (and SMS)
  have MIREX volunteers incrementally create
   brand new test collections for other tasks

                   Better Yet
 study low-cost methodologies for the other tasks
                 Not Only for MIREX
    private collections for in-house evaluations
no possibility of gathering large pools of annotators
           lost-cost becomes paramount
the MIR community
   needs a paradigm shift
from a priori to a posteriori
    evaluation methods
       to reduce cost
     and gain reliability

Weitere ähnliche Inhalte

Andere mochten auch

Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Julián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...Julián Urbano
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationRichard Diamond
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondRichard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondRichard Diamond
 

Andere mochten auch (13)

Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
Bringing Undergraduate Students Closer to a Real-World Information Retrieval ...
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...How Significant is Statistically Significant? The case of Audio Music Similar...
How Significant is Statistically Significant? The case of Audio Music Similar...
 
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 PresentationThreshold Concepts in Quantitative Finance - DEE 2011 Presentation
Threshold Concepts in Quantitative Finance - DEE 2011 Presentation
 
CAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard DiamondCAPM: Introduction & Teaching Issues - Richard Diamond
CAPM: Introduction & Teaching Issues - Richard Diamond
 
Median and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard DiamondMedian and Its Significance - Dr Richard Diamond
Median and Its Significance - Dr Richard Diamond
 

Ähnlich wie Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Adrian Aley
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix DatasetBen Mabey
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentJordan McBain
 
Clustering Theory
Clustering TheoryClustering Theory
Clustering TheorySSA KPI
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierNeha Kulkarni
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...PyData
 
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...IFPRI-EPTD
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)SSA KPI
 
Presentation at SMI 2023
Presentation at SMI 2023Presentation at SMI 2023
Presentation at SMI 2023Joaquim Jorge
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingHsing-chuan Hsieh
 
Statistical quality__control_2
Statistical  quality__control_2Statistical  quality__control_2
Statistical quality__control_2Tech_MX
 

Ähnlich wie Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval (20)

Resolution
ResolutionResolution
Resolution
 
Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)Intro to Quant Trading Strategies (Lecture 10 of 10)
Intro to Quant Trading Strategies (Lecture 10 of 10)
 
1.2
1.21.2
1.2
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
Condition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating EquipmentCondition Monitoring Of Unsteadily Operating Equipment
Condition Monitoring Of Unsteadily Operating Equipment
 
Adam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the OddballsAdam Ashenfelter - Finding the Oddballs
Adam Ashenfelter - Finding the Oddballs
 
Measurement_and_Units.pptx
Measurement_and_Units.pptxMeasurement_and_Units.pptx
Measurement_and_Units.pptx
 
Clustering Theory
Clustering TheoryClustering Theory
Clustering Theory
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
K-Nearest Neighbor Classifier
K-Nearest Neighbor ClassifierK-Nearest Neighbor Classifier
K-Nearest Neighbor Classifier
 
TunUp final presentation
TunUp final presentationTunUp final presentation
TunUp final presentation
 
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Pr...
 
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
Biosight: Quantitative Methods for Policy Analysis: Stochastic Dynamic Progra...
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)Methods from Mathematical Data Mining (Supported by Optimization)
Methods from Mathematical Data Mining (Supported by Optimization)
 
Presentation at SMI 2023
Presentation at SMI 2023Presentation at SMI 2023
Presentation at SMI 2023
 
Input analysis
Input analysisInput analysis
Input analysis
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
Linear sort
Linear sortLinear sort
Linear sort
 
Statistical quality__control_2
Statistical  quality__control_2Statistical  quality__control_2
Statistical quality__control_2
 

Mehr von Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 

Mehr von Julián Urbano (10)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 

Kürzlich hochgeladen

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Kürzlich hochgeladen (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval

  • 1. Towards Minimal Test Collections for Evaluation of Audio Music Similarity and Retrieval @julian_urbano @m_schedl University Carlos III of Madrid Johannes Kepler University AdMIRe 2012 Picture by ERdi43 (Wikipedia) Lyon, France · April 17th
  • 2. Problem evaluation of IR systems is costly Annotations time consuming expensive boring (Bad) Consequence small and biased test collections unlikely to change from year to year Solution apply low-cost evaluation methodologies
  • 3. nearly 2 decades of Meta-Evaluation in Text IR some good practices inherited from here NTCIR CLEF Cranfield 2 MEDLARS SMART TREC (1999-today) (2000-today) (1962-1966) (1966-1967) (1992-today) (1961-1995) 1960 2011 ISMIR MIREX (2000-today) (2005-today) a lot of things have happened here!
  • 4. Minimal Test Collections (MTC) [Carterette at al.] estimate the ranking of systems with very few judgments (high incompleteness) Application in Audio Music Similarity (AMS) dozens of volunteers required by MIREX every year to make thousands of judgments Year Teams Systems Queries Results Judgments Overlap 2006 5 6 60 1,800 1,629 10% 2007 8 12 100 6,000 4,832 19% 2009 9 15 100 7,500 6,732 10% 2010 5 8 100 4,000 2,737 32% 2011 10 18 100 9,000 6,322 30%
  • 5. evaluation with incomplete judgments
  • 6. Basic Idea treat similarity scores as random variables can be estimated with uncertainty gain of an arbitrary document: Gi ⤳ multinomial 𝐸 𝐺𝑖 = 𝑃 𝐺𝑖 = 𝑙 · 𝑙 𝑙∈ℒ ℒ 𝐵𝑅𝑂𝐴𝐷 = 0, 1, 2 ℒ 𝐹𝐼𝑁𝐸 = {0, 1, … , 100} whenever document i is judged: 𝐸 𝐺𝑖 = 𝑙 𝑉𝑎𝑟 𝐺 𝑖 = 0 *all variance formulas in the paper
  • 7. AG@k is also treated as a random variable 1 𝐸 𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 iterate all documents ranking at which (in practice, only it was retrieved the top k retrieved) Ultimate Goal compute a good estimate with the least effort
  • 8. Comparing Two Systems what really matters is the sign of the difference 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 𝑘 𝑖∈𝒟 Evaluating Several Queries 1 𝐸 𝛥𝐴𝐺@𝑘 = 𝐸 𝛥𝐴𝐺@𝑘 𝑞 𝒬 𝑞∈𝒬 iterate all queries The Rationale if 𝛼 < 𝑃 Δ𝐴𝐺@𝑘 ≤ 0 < 1 − 𝛼 then judge another document else stop judging
  • 9. Distribution of AG@k what are the possible assignments of similarity? 𝑃 𝐴𝐺@𝑘 = 𝓏 ≔ 𝑃 𝐴𝐺@𝑘 = 𝓏 𝛾 𝑘 · 𝑃 𝛾 𝑘 𝛾 𝑘 ∈𝛤 𝑘 ultimately iterate all possible depends on the permutations of k distribution of Gi similarity assignments Plain English the ratio of similarity assignments s.t. AG@k=z For Complex Measures or Large Similarity Scales run Monte Carlo simulation
  • 10. Actually, AG@k is a Special Case let G be the similarity of the top k for all queries query AG@k for a single query 1. take a sample of k documents. Mean = X1 2. take a sample of k documents. Mean = X2 ... Q. take a sample of k documents. Mean = XQ Mean of sample means = X mean AG@k over all queries Central Limit Theorem as Q→∞, X approximates a normal distribution regardless of the distribution of G
  • 11. AG@k is Normally Distributed use the normal cumulative density function Φ −𝐸 ∆𝐴𝐺@𝑘 𝑃 ∆𝐴𝐺@𝑘 ≤ 0 = Φ 𝑉𝑎𝑟 ∆𝐴𝐺@𝑘 BROAD scale FINE scale 0.030 0.0 0.2 0.4 0.6 0.8 1.0 0.020 Density Density 0.010 0.000 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100 AG@5 AG@5
  • 12. Confidence as a Function of # Judgments 100 95 90 or waste Confidence in ranking of systems our time or keep judging 85 we can to be really confident 80 stop judging 75 70 65 60 55 50 0 10 20 30 40 50 60 70 80 90 100 Percent of judgments what documents should we judge? those that maximize the confidence
  • 13. The Trick documents retrieved by both systems are useless there is no need to judge them whatever Gi is, it is added and then subtracted Comparing Several Systems compute a weight wi for each query-document judge the document with largest effect wi in the Original MTC wi = largest weight across system pairs reduces to # of system pairs affected by query-doc i
  • 14. wi Dependent on Confidence if we are highly confident about a pair of systems we do not need to judge another of their documents even if it has the largest weight 2 𝑤𝑖 = 1 − 𝐶 𝐴,𝐵 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵 𝑖 ≤ 𝑘 𝐴,𝐵 ∈𝒮−ℛ weight inversely proportional to confidence iterate system pairs with low confidence better results than traditional weights
  • 15. MTC for AMS with AG@k
  • 16. MTC for ΔAG@k average confidence on the ranking 1 while 𝐴,𝐵 ∈𝒮 𝐶 𝐴,𝐵 ≤ 1 − 𝛼 do 𝒮 select the 𝑖 ∗ ← 𝑎𝑟𝑔𝑚𝑎𝑥 𝑖 𝑤 𝑖 best document from all unjudged query-documents judge query-document 𝑖 ∗ (obtain true 𝑔𝑎𝑖𝑛 𝑖 ∗ ) 𝐸 𝐺 𝑖 ∗ ← 𝑔𝑎𝑖𝑛 𝑖 ∗ 𝑉𝑎𝑟 𝐺 𝑖 ∗ ← 0 update (increase confidence) end while
  • 18. Why MIREX 2011 largest edition so far 18 systems (153 pairwise comparisons) 100 queries and 6,322 judgments Distribution of Gi let us work with a uniform distribution for now
  • 19. Confidence as Judgments are Made correct bins: estimated sign is correct or not significant anyway
  • 20. Confidence as Judgments are Made correct bins: estimated sign is correct or not significant anyway
  • 21. Confidence as Judgments are Made correct bins: estimated sign is correct or not significant anyway
  • 23. Accuracy as Judgments are Made estimated bins always better than expected
  • 24. Accuracy as Judgments are Made estimated signs highly correlated with confidence
  • 25. Accuracy as Judgments are Made rankings with tau = 0.9 traditionally considered equivalent (same as 95% accuracy)
  • 26. high confidence and high accuracy with considerably less effort
  • 27. Statistical Significance MTC allows us to accurately estimate the ranking but for the current set of queries can we generalize to a general set of queries? Not Trivial we have the variance of the estimates but not the sample variance
  • 28. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 known judgments 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 𝑖∈𝜋 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋 *same for the lower bound
  • 29. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 retrieved by A 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − best 𝑘 similarity 𝑖∈𝜋 unknown judgments score 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 𝑘 𝑖∈𝜋 *same for the lower bound
  • 30. Work with Upper and Lower Bounds of ΔAG@k Upper bound: best case for A Lower bound: best case for B 1 ∆𝐴𝐺@𝑘 = 𝐺𝑖 · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝐼 𝐵𝑖 ≤ 𝑘 + 𝑘 𝑖∈𝜋 1 + 𝑙+ · 𝐼 𝐴 𝑖 ≤ 𝑘 − 𝑘 retrieved by B 𝑖∈𝜋 but not by A 1 − 𝑙− · 𝐼 𝐵 𝑖 ≤ 𝑘 ∧ 𝐴 𝑖 > 𝑘 worst 𝑘 similarity 𝑖∈𝜋 unknown judgments score *same for the lower bound
  • 31. 3 Rules 1. Assume best case for A (upper bound) if A <<< B then conclude A <<< B 2. Assume best case for B (lower bound) if B <<< A then conclude B <<< A 3. If in the best case for A we do not have A >>> B and in the best case for B we do not have B >>> A then conclude they are not significantly different Problem upper and lower bounds are very unrealistic
  • 32. Incorporate a Heuristic 4. If the estimated difference is larger than t naively conclude significance Choose t Based on Power Analysis t = effect-size detectable by a t-test with • sample variance σ2=0.0615 from previous • sample size n=100 MIREX editions • Type I Error rate α=0.05 typical values • Type II Error rate β=0.15 t ≈ 0.067
  • 33. Accuracy of the Significance Estimates pretty good around 95% confidence rule 4 (heuristic) ends up overestimating significance
  • 34. Accuracy of the Significance Estimates rules 1 to 3 begin to apply and correct overestimations rule 4 (heuristic) ends up overestimating significance
  • 35. Accuracy of the Significance Estimates closer to expected never under 90%
  • 36. significance can be estimated fairly well too
  • 38. Introduce MTC to the MIR folks Work out the Math for MTC with AG@k See How Well it would have Done in AMS 2011 quite well actually!
  • 40. Learn the true Distribution of Similarity Judgments it‘s clearly not uniform would give more accurate estimates with less effort use previous AMS data or fit a model as we judge Significance Testing with Incomplete Judgments best-case scenarios are very unrealistic Study Low-Cost Methodologies for other MIR Tasks
  • 42. MTC Greatly Reduces the Effort for AMS (and SMS) have MIREX volunteers incrementally create brand new test collections for other tasks Better Yet study low-cost methodologies for the other tasks Not Only for MIREX private collections for in-house evaluations no possibility of gathering large pools of annotators lost-cost becomes paramount
  • 43. the MIR community needs a paradigm shift from a priori to a posteriori evaluation methods to reduce cost and gain reliability