SlideShare ist ein Scribd-Unternehmen logo
1 von 75
Downloaden Sie, um offline zu lesen
How Significant is Statistically Significant?
                         The Case of Audio Music Similarity and Retrieval

                             @julian_urbano       University Carlos III of Madrid
                             J. Stephen Downie    University of Illinois at Urbana-Champaign
                                    Brian McFee   University of California at San Diego
                                  Markus Schedl   Johannes Kepler University Linz




                                                                                                   ISMIR 2012
Picture by Humberto Santos                                                          Porto, Portugal · October 9th
let’s review two papers
statistically
                  significant
paper A:                     paper B:

+0.14*                       +0.21
…which one should get published?
a.k.a. which research line should we follow?
statistically
                  significant
paper A:                     paper B:

+0.14*                       +0.21
…which one should get published?
a.k.a. which research line should we follow?
paper A:                    paper B:

+0.14*                      +0.14*
…which one should get published?
a.k.a. which research line should we follow?
paper A:                    paper B:

+0.14*                      +0.14*
…which one should get published?
a.k.a. which research line should we follow?
Goal of Comparing Systems…
Find out the effectiveness difference 𝒅
  (arbitrary query and arbitrary user)
        Impossible!

  requires running
 the systems for the
universe of all queries



   -1                       0          𝑑   1
                      Δeffectiveness
…what Evaluations can do
     Estimate 𝒅 with the average 𝑑
      over a sample of queries 𝓠




-1                   0          𝑑    1
               Δeffectiveness
…what Evaluations can do
     Estimate 𝒅 with the average 𝑑
      over a sample of queries 𝓠




-1                   0      𝑑        1
               Δeffectiveness
…what Evaluations can do
     Estimate 𝒅 with the average 𝑑
      over a sample of queries 𝓠




-1                   0          𝑑    1
               Δeffectiveness
…what Evaluations can do
     Estimate 𝒅 with the average 𝑑
      over a sample of queries 𝓠




-1                 𝑑0                1
               Δeffectiveness
…what Evaluations can do
   Estimate 𝒅 with the average 𝑑
    over a sample of queries 𝓠

There is always random error

     …so we need a
  measure of confidence
The Significance Drill
 Test these hypotheses
        H 0: 𝑑 = 0
        H 1: 𝑑 ≠ 0
The Significance Drill
   Test these hypotheses
          H 0: 𝑑 = 0
          H 1: 𝑑 ≠ 0
     Result of the test…
     p-value = P( 𝒅 | H0 )
 …interpretation of the test
p-value is very small: reject H0
           otherwise: accept H0
The Significance Drill
  Test these hypotheses
         H 0: 𝑑 = 0
         H 1: 𝑑 ≠ 0

We accept/reject H0…
  (based on the p-value and α)


    …not the test!
Usual (wrong) conclusions
   A is substantially than B

    A is much better than B

  The difference is important

  The difference is significant
What does it mean?
That there is a difference
   (unlikely due to chance/random error)
What does it mean?
That there is a difference
    (unlikely due to chance/random error)


We don’t need fancy statistics…

   …we already know
   they are different!
H0: 𝒅 = 0
 is false by definition

 because systems A and B
are different to begin with
What is really important?
    The effect-size:
    magnitude of 𝑑
            This is what predicts user
            satisfaction, not p-values
What is really important?
     The effect-size:
     magnitude of 𝑑
               This is what predicts user
               satisfaction, not p-values

𝒅 = +0.6 is a huge improvement
𝒅 = +0.0001 is irrelevant…
           …and yet, it can easily be
            statistically significant
Example: t-test
     𝒅·        𝓠   The larger the statistic 𝑡,
𝒕=
          𝒔𝒅        the smaller the p-value

How to achieve statistical significance?
Example: t-test
     𝒅·        𝓠   The larger the statistic 𝑡,
𝒕=
          𝒔𝒅        the smaller the p-value

How to achieve statistical significance?
a) Reduce variance
Example: t-test
     𝒅·        𝓠   The larger the statistic 𝑡,
𝒕=
          𝒔𝒅        the smaller the p-value

How to achieve statistical significance?
a) Reduce variance
b) Further improve the system
Example: t-test
     𝒅·        𝓠   The larger the statistic 𝑡,
𝒕=
          𝒔𝒅        the smaller the p-value

How to achieve statistical significance?
a) Reduce variance
b) Further improve the system
c) Evaluate with more queries!
Statistical Significance is
eventually meaningless…

  …all you have to do is
  use enough queries
Practical Significance: Effect-Size 𝑑
     Effectiveness / Satisfaction
 Statistical Significance: p-value
             Confidence

    An improvement may be
 statistically significant, but that
  doesn’t mean it’s important!
the real importance
of an improvement
Purpose of Evaluation
  How good              Is system A
is my system?           better than
                         system B?


   0           1   -1          0          1
   effectiveness         Δeffectiveness


We measure system effectiveness
Assumption
System Effectiveness
   corresponds to
  User Satisfaction
    user satisfaction




                        system effectiveness
Assumption
System Effectiveness
   corresponds to
  User Satisfaction
    user satisfaction




                        system effectiveness
Assumption
System Effectiveness
   corresponds to
  User Satisfaction
    user satisfaction




                        system effectiveness
Assumption
System Effectiveness
   corresponds to
  User Satisfaction
    user satisfaction




                        system effectiveness
Assumption
System Effectiveness
   corresponds to
  User Satisfaction
    user satisfaction




                        system effectiveness
Assumption
            System Effectiveness
               corresponds to
              User Satisfaction
  this is our
ultimate goal!



         Does it? How well?
How we measure
System Effectiveness
        Similarity scale           we normalize
                                   to [0, 1]
        Broad: 0, 1 or 2
         Fine: 0, 1, 2, ..., 100
    Effectiveness measure
   AG@5: ignore the ranking
 nDCG@5: discount by rank

What correlates better
with user satisfaction?
Experiment
Experiment
Experiment




      known
   effectiveness
Experiment




 user preference
Experiment




     non-preference
What can we infer?
               Preference
           (difference noticed by user)
 Positive: user agrees with evaluation
Negative: user disagrees with evaluation

           Non-preference
         (difference not noticed by user)
 Good: both systems are satisfying
  Bad: both systems are unsatisfying
Data
  Clips and Similarity Judgments from
  MIREX 2011 Audio Music Similarity

    Random and Artificial examples
        Query: selected randomly
System outputs: random lists of 5 documents

 2200 examples for 73 unique queries
2869 unique lists with 3031 unique clips
    balanced and complete design
Subjects
                 Crowdsourcing
  Cheap, fast and… diverse pool of subjects



    2200           Quality
  examples         control

  Trap examples (known answers)

$0.03 per example                 Worker pool
Results
       6895 total answers
  881 workers from 62 countries


  3393 accepted answers (41%)
   100 workers (87% rejected!)

95% average quality when accepted
How good is my system?
   884 nonpreferences (40%)




 What do we expect?
How good is my system?
   884 nonpreferences (40%)




                Linear
               mapping
How good is my system?
   884 nonpreferences (40%)




  What do we have?
How good is my system?
   884 nonpreferences (40%)
How good is my system?
   884 nonpreferences (40%)
How good is my system?
   884 nonpreferences (40%)
   room for ~20%
    improvement
        with
   personalization
Is system A better than B?
     1316 preferences (60%)




  What do we expect?
Is system A better than B?
     1316 preferences (60%)


       Users always notice
         the difference…



         …regardless of
         how large it is
Is system A better than B?
     1316 preferences (60%)




   What do we have?
Is system A better than B?
     1316 preferences (60%)
Is system A better than B?
     1316 preferences (60%)
Is system A better than B?
     1316 preferences (60%)




         >.3 & >.4 differences for
          >50% of users to agree
Is system A better than B?
     1316 preferences (60%)




             Fine scale is closer
             to the ideal 100%
Is system A better than B?
     1316 preferences (60%)


   Do users prefer the
      (supposedly)
     worse system?
Is system A better than B?
     1316 preferences (60%)
Statistical Significance

     has nothing
    to do with this
Picture by Ronny Welter
Reporting Results
Confidence intervals / Variance

      0.584
Reporting Results
Confidence intervals / Variance

      0.584 ± .023
 Indicator of evaluation error
   Better understanding of
  expected user satisfaction
Reporting Results
      Actual p-values

+0.037 ± .031 *
Reporting Results
            Actual p-values

+0.037 ± .031 (p=0.02)
 Statistical Significance is relative
       α=0.05 and α=0.01
     are completely arbitrary
Depends on context, cost of Type I
 errors and implementation, etc.
let’s review two papers

        (again)
paper A:
+0.14*
paper B:
+0.21
  …which one should get published?
  a.k.a. which research line should we follow?
paper A (500 queries):
+0.14 ± 0.03 (p=0.048)
paper B (50 queries):
+0.21 ± 0.02 (p=0.052)
  …which one should get published?
   a.k.a. which research line should we follow?
paper A (500 queries):
+0.14 ± 0.03 (p=0.048)
paper B (50 queries):
+0.21 ± 0.02 (p=0.052)
  …which one should get published?
   a.k.a. which research line should we follow?
paper A:
+0.14 *
paper B:
+0.14 *
  …which one should get published?
  a.k.a. which research line should we follow?
paper A (cost=$500,000):
+0.14 ± 0.01 (p=0.004)
paper B (cost=$50):
+0.14 ± 0.03 (p=0.043)
  …which one should get published?
  a.k.a. which research line should we follow?
paper A (cost=$500,000):
+0.14 ± 0.01 (p=0.004)
paper B (cost=$50):
+0.14 ± 0.03 (p=0.043)
  …which one should get published?
  a.k.a. which research line should we follow?
effect-sizes are
indicators of user satisfaction
      need to personalize results
    small differences are not noticed

        p-values are
  indicators of confidence
        beware of collection size

need to provide full reports
The difference between
    “Significant” and
    “Not Significant”
       is not itself
 statistically significant
                   ― A. Gelman & H. Stern

Weitere ähnliche Inhalte

Was ist angesagt?

Hypothesis testing1
Hypothesis testing1Hypothesis testing1
Hypothesis testing1HanaaBayomy
 
Basics of statistics
Basics of statisticsBasics of statistics
Basics of statisticsGaurav Kr
 
Presentation on Hypothesis Test by Ashik Amin Prem
Presentation on Hypothesis Test by Ashik Amin PremPresentation on Hypothesis Test by Ashik Amin Prem
Presentation on Hypothesis Test by Ashik Amin PremAshikAminPrem
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testingo_devinyak
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1shoffma5
 
Testing of hypotheses
Testing of hypothesesTesting of hypotheses
Testing of hypothesesRajThakuri
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis TestingJeremy Lane
 
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...The Stockker
 
Hypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.comHypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.comEyenirvaan
 
Basis of statistical inference
Basis of statistical inferenceBasis of statistical inference
Basis of statistical inferencezahidacademy
 
HYPOTHESIS TESTING
HYPOTHESIS TESTINGHYPOTHESIS TESTING
HYPOTHESIS TESTINGAmna Sheikh
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introductionGeetika Gulyani
 

Was ist angesagt? (20)

Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Hypothesis testing1
Hypothesis testing1Hypothesis testing1
Hypothesis testing1
 
Basics of statistics
Basics of statisticsBasics of statistics
Basics of statistics
 
Presentation on Hypothesis Test by Ashik Amin Prem
Presentation on Hypothesis Test by Ashik Amin PremPresentation on Hypothesis Test by Ashik Amin Prem
Presentation on Hypothesis Test by Ashik Amin Prem
 
Lecture2 hypothesis testing
Lecture2 hypothesis testingLecture2 hypothesis testing
Lecture2 hypothesis testing
 
Hypothesis testing Part1
Hypothesis testing Part1Hypothesis testing Part1
Hypothesis testing Part1
 
Review Z Test Ci 1
Review Z Test Ci 1Review Z Test Ci 1
Review Z Test Ci 1
 
Testing of hypotheses
Testing of hypothesesTesting of hypotheses
Testing of hypotheses
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Testing of Hypothesis
Testing of Hypothesis Testing of Hypothesis
Testing of Hypothesis
 
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
Research methodology - Estimation Theory & Hypothesis Testing, Techniques of ...
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.comHypothesis testing and p-value, www.eyenirvaan.com
Hypothesis testing and p-value, www.eyenirvaan.com
 
Test for proportion
Test for proportionTest for proportion
Test for proportion
 
Basis of statistical inference
Basis of statistical inferenceBasis of statistical inference
Basis of statistical inference
 
HYPOTHESIS TESTING
HYPOTHESIS TESTINGHYPOTHESIS TESTING
HYPOTHESIS TESTING
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introduction
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 

Andere mochten auch

서비스 제출
서비스 제출서비스 제출
서비스 제출진실 김
 
Proyecto vivienda domótica
Proyecto vivienda domóticaProyecto vivienda domótica
Proyecto vivienda domóticacollau5
 
서비스 프로토만
서비스 프로토만서비스 프로토만
서비스 프로토만진실 김
 
La Falsa Religion De La Paz
La Falsa Religion De La PazLa Falsa Religion De La Paz
La Falsa Religion De La Pazkk DeLujo
 
Escuela de padres, segunda actividad
Escuela de padres, segunda actividadEscuela de padres, segunda actividad
Escuela de padres, segunda actividadauladeapoyoiesf
 
3 los elementos_del_curriculo
3 los elementos_del_curriculo3 los elementos_del_curriculo
3 los elementos_del_curriculoOscar Machorro
 
[정보검색론] 전문자료정보검색 준사서E 5조
[정보검색론] 전문자료정보검색 준사서E 5조[정보검색론] 전문자료정보검색 준사서E 5조
[정보검색론] 전문자료정보검색 준사서E 5조SSePhi
 
Social Web Planning 4
Social Web Planning 4Social Web Planning 4
Social Web Planning 4sibalmonkeys
 
동그라미 런2
동그라미 런2동그라미 런2
동그라미 런2moonjunu
 
김사현의 Europeana
김사현의 Europeana김사현의 Europeana
김사현의 EuropeanaBaro Kim
 
Promotional posters 5
Promotional posters 5Promotional posters 5
Promotional posters 5Mark Spure
 
«Επιχειρηματικότητα και Πράσινη Στρατηγική» - Κάρολος Παπαδάς
«Επιχειρηματικότητα και Πράσινη Στρατηγική»  - Κάρολος Παπαδάς«Επιχειρηματικότητα και Πράσινη Στρατηγική»  - Κάρολος Παπαδάς
«Επιχειρηματικότητα και Πράσινη Στρατηγική» - Κάρολος ΠαπαδάςStarttech Ventures
 
Chitarra Romana
Chitarra RomanaChitarra Romana
Chitarra Romanakk DeLujo
 

Andere mochten auch (20)

서비스 제출
서비스 제출서비스 제출
서비스 제출
 
Proyecto vivienda domótica
Proyecto vivienda domóticaProyecto vivienda domótica
Proyecto vivienda domótica
 
Pi sga
Pi sgaPi sga
Pi sga
 
Boletín Agroclimático Eje Cafetero #1
Boletín Agroclimático Eje Cafetero #1Boletín Agroclimático Eje Cafetero #1
Boletín Agroclimático Eje Cafetero #1
 
서비스 프로토만
서비스 프로토만서비스 프로토만
서비스 프로토만
 
Ford focus 1.6 eco boost titanium
Ford focus 1.6 eco boost titaniumFord focus 1.6 eco boost titanium
Ford focus 1.6 eco boost titanium
 
La Falsa Religion De La Paz
La Falsa Religion De La PazLa Falsa Religion De La Paz
La Falsa Religion De La Paz
 
Escuela de padres, segunda actividad
Escuela de padres, segunda actividadEscuela de padres, segunda actividad
Escuela de padres, segunda actividad
 
3 los elementos_del_curriculo
3 los elementos_del_curriculo3 los elementos_del_curriculo
3 los elementos_del_curriculo
 
[정보검색론] 전문자료정보검색 준사서E 5조
[정보검색론] 전문자료정보검색 준사서E 5조[정보검색론] 전문자료정보검색 준사서E 5조
[정보검색론] 전문자료정보검색 준사서E 5조
 
Group 4 spelling
Group 4 spelling Group 4 spelling
Group 4 spelling
 
Social Web Planning 4
Social Web Planning 4Social Web Planning 4
Social Web Planning 4
 
ABSTRACT
ABSTRACTABSTRACT
ABSTRACT
 
동그라미 런2
동그라미 런2동그라미 런2
동그라미 런2
 
4lifecardio
4lifecardio4lifecardio
4lifecardio
 
김사현의 Europeana
김사현의 Europeana김사현의 Europeana
김사현의 Europeana
 
Promotional posters 5
Promotional posters 5Promotional posters 5
Promotional posters 5
 
«Επιχειρηματικότητα και Πράσινη Στρατηγική» - Κάρολος Παπαδάς
«Επιχειρηματικότητα και Πράσινη Στρατηγική»  - Κάρολος Παπαδάς«Επιχειρηματικότητα και Πράσινη Στρατηγική»  - Κάρολος Παπαδάς
«Επιχειρηματικότητα και Πράσινη Στρατηγική» - Κάρολος Παπαδάς
 
Q4 Earnings Slides 10 27 15
Q4 Earnings Slides 10 27 15Q4 Earnings Slides 10 27 15
Q4 Earnings Slides 10 27 15
 
Chitarra Romana
Chitarra RomanaChitarra Romana
Chitarra Romana
 

Ähnlich wie How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Novelties in social science statistics
Novelties in social science statisticsNovelties in social science statistics
Novelties in social science statisticsJiri Haviger
 
Causality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELCausality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELHanan Shteingart
 
Statistical hypothesis testing in e commerce
Statistical hypothesis testing in e commerceStatistical hypothesis testing in e commerce
Statistical hypothesis testing in e commerceAnatoliy Vuets
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Olivier Jeunen
 
Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)kgr023
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSSRayman Soe
 
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testingWayne Lee
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Matt Hansen
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesGalit Shmueli
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdfruwanp2000
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Matt Hansen
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialBilkent University
 
Download the presentation
Download the presentationDownload the presentation
Download the presentationbutest
 
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...Smarten Augmented Analytics
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using RANURAG SINGH
 

Ähnlich wie How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval (20)

Novelties in social science statistics
Novelties in social science statisticsNovelties in social science statistics
Novelties in social science statistics
 
Causality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAELCausality in Python PyCon 2021 ISRAEL
Causality in Python PyCon 2021 ISRAEL
 
Statistical hypothesis testing in e commerce
Statistical hypothesis testing in e commerceStatistical hypothesis testing in e commerce
Statistical hypothesis testing in e commerce
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
Revisiting Offline Evaluation for Implicit-Feedback Recommender Systems (Doct...
 
Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)Statistical Writing (Sven Sandin)
Statistical Writing (Sven Sandin)
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
ABTest-20231020.pptx
ABTest-20231020.pptxABTest-20231020.pptx
ABTest-20231020.pptx
 
Statistical Analysis
Statistical AnalysisStatistical Analysis
Statistical Analysis
 
3b. Introductory Statistics - Julia Saperia
3b. Introductory Statistics - Julia Saperia3b. Introductory Statistics - Julia Saperia
3b. Introductory Statistics - Julia Saperia
 
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, NetflixUX STRAT Online 2020: Dr. Martin Tingley, Netflix
UX STRAT Online 2020: Dr. Martin Tingley, Netflix
 
Crash Course in A/B testing
Crash Course in A/B testingCrash Course in A/B testing
Crash Course in A/B testing
 
Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)Hypothesis Testing: Proportions (Compare 1:Standard)
Hypothesis Testing: Proportions (Compare 1:Standard)
 
Too Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False DiscoveriesToo Large To Fail: Large Samples and False Discoveries
Too Large To Fail: Large Samples and False Discoveries
 
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
Customer Satisfaction Data -  Multiple Linear Regression Model.pdfCustomer Satisfaction Data -  Multiple Linear Regression Model.pdf
Customer Satisfaction Data - Multiple Linear Regression Model.pdf
 
Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)Hypothesis Testing: Relationships (Compare 1:1)
Hypothesis Testing: Relationships (Compare 1:1)
 
Performance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorialPerformance Evaluation for Classifiers tutorial
Performance Evaluation for Classifiers tutorial
 
Download the presentation
Download the presentationDownload the presentation
Download the presentation
 
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...What is the Independent Samples T Test Method of Analysis and How Can it Bene...
What is the Independent Samples T Test Method of Analysis and How Can it Bene...
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
 

Mehr von Julián Urbano

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Julián Urbano
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowJulián Urbano
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationJulián Urbano
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationJulián Urbano
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured DocumentsJulián Urbano
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...Julián Urbano
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...Julián Urbano
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackJulián Urbano
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...Julián Urbano
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Julián Urbano
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Julián Urbano
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityJulián Urbano
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalJulián Urbano
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityJulián Urbano
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Julián Urbano
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...Julián Urbano
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Julián Urbano
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityJulián Urbano
 

Mehr von Julián Urbano (20)

Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...Statistical Significance Testing in Information Retrieval: An Empirical Analy...
Statistical Significance Testing in Information Retrieval: An Empirical Analy...
 
Your PhD and You
Your PhD and YouYour PhD and You
Your PhD and You
 
Statistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and HowStatistical Analysis of Results in Music Information Retrieval: Why and How
Statistical Analysis of Results in Music Information Retrieval: Why and How
 
The Treatment of Ties in AP Correlation
The Treatment of Ties in AP CorrelationThe Treatment of Ties in AP Correlation
The Treatment of Ties in AP Correlation
 
A Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR EvaluationA Plan for Sustainable MIR Evaluation
A Plan for Sustainable MIR Evaluation
 
Crawling the Web for Structured Documents
Crawling the Web for Structured DocumentsCrawling the Web for Structured Documents
Crawling the Web for Structured Documents
 
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
How Do Gain and Discount Functions Affect the Correlation between DCG and Use...
 
A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...A Comparison of the Optimality of Statistical Significance Tests for Informat...
A Comparison of the Optimality of Statistical Significance Tests for Informat...
 
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
MIREX 2010 Symbolic Melodic Similarity: Local Alignment with Geometric Repres...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing TrackThe University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track
 
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Fea...
 
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
Evaluation in (Music) Information Retrieval through the Audio Music Similarit...
 
Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)Symbolic Melodic Similarity (through Shape Similarity)
Symbolic Melodic Similarity (through Shape Similarity)
 
Evaluation in Audio Music Similarity
Evaluation in Audio Music SimilarityEvaluation in Audio Music Similarity
Evaluation in Audio Music Similarity
 
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information RetrievalValidity and Reliability of Cranfield-like Evaluation in Information Retrieval
Validity and Reliability of Cranfield-like Evaluation in Information Retrieval
 
On the Measurement of Test Collection Reliability
On the Measurement of Test Collection ReliabilityOn the Measurement of Test Collection Reliability
On the Measurement of Test Collection Reliability
 
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
Towards Minimal Test Collections for Evaluation of Audio Music Similarity and...
 
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
The University Carlos III of Madrid at TREC 2011 Crowdsourcing Track: Noteboo...
 
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
Information Retrieval Meta-Evaluation: Challenges and Opportunities in the Mu...
 
Audio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and StabilityAudio Music Similarity and Retrieval: Evaluation Power and Stability
Audio Music Similarity and Retrieval: Evaluation Power and Stability
 

How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

  • 1. How Significant is Statistically Significant? The Case of Audio Music Similarity and Retrieval @julian_urbano University Carlos III of Madrid J. Stephen Downie University of Illinois at Urbana-Champaign Brian McFee University of California at San Diego Markus Schedl Johannes Kepler University Linz ISMIR 2012 Picture by Humberto Santos Porto, Portugal · October 9th
  • 3. statistically significant paper A: paper B: +0.14* +0.21 …which one should get published? a.k.a. which research line should we follow?
  • 4. statistically significant paper A: paper B: +0.14* +0.21 …which one should get published? a.k.a. which research line should we follow?
  • 5. paper A: paper B: +0.14* +0.14* …which one should get published? a.k.a. which research line should we follow?
  • 6. paper A: paper B: +0.14* +0.14* …which one should get published? a.k.a. which research line should we follow?
  • 7. Goal of Comparing Systems… Find out the effectiveness difference 𝒅 (arbitrary query and arbitrary user) Impossible! requires running the systems for the universe of all queries -1 0 𝑑 1 Δeffectiveness
  • 8. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
  • 9. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
  • 10. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 0 𝑑 1 Δeffectiveness
  • 11. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 -1 𝑑0 1 Δeffectiveness
  • 12. …what Evaluations can do Estimate 𝒅 with the average 𝑑 over a sample of queries 𝓠 There is always random error …so we need a measure of confidence
  • 13. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0
  • 14. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 Result of the test… p-value = P( 𝒅 | H0 ) …interpretation of the test p-value is very small: reject H0 otherwise: accept H0
  • 15. The Significance Drill Test these hypotheses H 0: 𝑑 = 0 H 1: 𝑑 ≠ 0 We accept/reject H0… (based on the p-value and α) …not the test!
  • 16. Usual (wrong) conclusions A is substantially than B A is much better than B The difference is important The difference is significant
  • 17. What does it mean? That there is a difference (unlikely due to chance/random error)
  • 18. What does it mean? That there is a difference (unlikely due to chance/random error) We don’t need fancy statistics… …we already know they are different!
  • 19. H0: 𝒅 = 0 is false by definition because systems A and B are different to begin with
  • 20. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values
  • 21. What is really important? The effect-size: magnitude of 𝑑 This is what predicts user satisfaction, not p-values 𝒅 = +0.6 is a huge improvement 𝒅 = +0.0001 is irrelevant… …and yet, it can easily be statistically significant
  • 22. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance?
  • 23. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance
  • 24. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance b) Further improve the system
  • 25. Example: t-test 𝒅· 𝓠 The larger the statistic 𝑡, 𝒕= 𝒔𝒅 the smaller the p-value How to achieve statistical significance? a) Reduce variance b) Further improve the system c) Evaluate with more queries!
  • 26. Statistical Significance is eventually meaningless… …all you have to do is use enough queries
  • 27. Practical Significance: Effect-Size 𝑑 Effectiveness / Satisfaction Statistical Significance: p-value Confidence An improvement may be statistically significant, but that doesn’t mean it’s important!
  • 28. the real importance of an improvement
  • 29. Purpose of Evaluation How good Is system A is my system? better than system B? 0 1 -1 0 1 effectiveness Δeffectiveness We measure system effectiveness
  • 30. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  • 31. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  • 32. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  • 33. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  • 34. Assumption System Effectiveness corresponds to User Satisfaction user satisfaction system effectiveness
  • 35. Assumption System Effectiveness corresponds to User Satisfaction this is our ultimate goal! Does it? How well?
  • 36. How we measure System Effectiveness Similarity scale we normalize to [0, 1] Broad: 0, 1 or 2 Fine: 0, 1, 2, ..., 100 Effectiveness measure AG@5: ignore the ranking nDCG@5: discount by rank What correlates better with user satisfaction?
  • 39. Experiment known effectiveness
  • 41. Experiment non-preference
  • 42. What can we infer? Preference (difference noticed by user) Positive: user agrees with evaluation Negative: user disagrees with evaluation Non-preference (difference not noticed by user) Good: both systems are satisfying Bad: both systems are unsatisfying
  • 43. Data Clips and Similarity Judgments from MIREX 2011 Audio Music Similarity Random and Artificial examples Query: selected randomly System outputs: random lists of 5 documents 2200 examples for 73 unique queries 2869 unique lists with 3031 unique clips balanced and complete design
  • 44. Subjects Crowdsourcing Cheap, fast and… diverse pool of subjects 2200 Quality examples control Trap examples (known answers) $0.03 per example Worker pool
  • 45. Results 6895 total answers 881 workers from 62 countries 3393 accepted answers (41%) 100 workers (87% rejected!) 95% average quality when accepted
  • 46. How good is my system? 884 nonpreferences (40%) What do we expect?
  • 47. How good is my system? 884 nonpreferences (40%) Linear mapping
  • 48. How good is my system? 884 nonpreferences (40%) What do we have?
  • 49. How good is my system? 884 nonpreferences (40%)
  • 50. How good is my system? 884 nonpreferences (40%)
  • 51. How good is my system? 884 nonpreferences (40%) room for ~20% improvement with personalization
  • 52. Is system A better than B? 1316 preferences (60%) What do we expect?
  • 53. Is system A better than B? 1316 preferences (60%) Users always notice the difference… …regardless of how large it is
  • 54. Is system A better than B? 1316 preferences (60%) What do we have?
  • 55. Is system A better than B? 1316 preferences (60%)
  • 56. Is system A better than B? 1316 preferences (60%)
  • 57. Is system A better than B? 1316 preferences (60%) >.3 & >.4 differences for >50% of users to agree
  • 58. Is system A better than B? 1316 preferences (60%) Fine scale is closer to the ideal 100%
  • 59. Is system A better than B? 1316 preferences (60%) Do users prefer the (supposedly) worse system?
  • 60. Is system A better than B? 1316 preferences (60%)
  • 61. Statistical Significance has nothing to do with this
  • 64. Reporting Results Confidence intervals / Variance 0.584 ± .023 Indicator of evaluation error Better understanding of expected user satisfaction
  • 65. Reporting Results Actual p-values +0.037 ± .031 *
  • 66. Reporting Results Actual p-values +0.037 ± .031 (p=0.02) Statistical Significance is relative α=0.05 and α=0.01 are completely arbitrary Depends on context, cost of Type I errors and implementation, etc.
  • 67. let’s review two papers (again)
  • 68. paper A: +0.14* paper B: +0.21 …which one should get published? a.k.a. which research line should we follow?
  • 69. paper A (500 queries): +0.14 ± 0.03 (p=0.048) paper B (50 queries): +0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
  • 70. paper A (500 queries): +0.14 ± 0.03 (p=0.048) paper B (50 queries): +0.21 ± 0.02 (p=0.052) …which one should get published? a.k.a. which research line should we follow?
  • 71. paper A: +0.14 * paper B: +0.14 * …which one should get published? a.k.a. which research line should we follow?
  • 72. paper A (cost=$500,000): +0.14 ± 0.01 (p=0.004) paper B (cost=$50): +0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
  • 73. paper A (cost=$500,000): +0.14 ± 0.01 (p=0.004) paper B (cost=$50): +0.14 ± 0.03 (p=0.043) …which one should get published? a.k.a. which research line should we follow?
  • 74. effect-sizes are indicators of user satisfaction need to personalize results small differences are not noticed p-values are indicators of confidence beware of collection size need to provide full reports
  • 75. The difference between “Significant” and “Not Significant” is not itself statistically significant ― A. Gelman & H. Stern