How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

How Significant is Statistically Significant?
The Case of Audio Music Similarity and Retrieval

@julian_urbano University Carlos III of Madrid
J. Stephen Downie University of Illinois at Urbana-Champaign
Brian McFee University of California at San Diego
Markus Schedl Johannes Kepler University Linz

ISMIR 2012
Picture by Humberto Santos Porto, Portugal · October 9th

statistically
significant
paper A: paper B:

+0.14* +0.21
…which one should get published?
a.k.a. which research line should we follow?

paper A: paper B:

+0.14* +0.14*

Goal of Comparing Systems…
Find out the effectiveness difference 𝒅
(arbitrary query and arbitrary user)
Impossible!

requires running
the systems for the
universe of all queries

-1 0 𝑑 1
Δeffectiveness

…what Evaluations can do
Estimate 𝒅 with the average 𝑑
over a sample of queries 𝓠

-1 0 𝑑 1
Δeffectiveness


-1 𝑑0 1
Δeffectiveness


There is always random error

…so we need a
measure of confidence

The Significance Drill
Test these hypotheses
H 0: 𝑑 = 0
H 1: 𝑑 ≠ 0

H 0: 𝑑 = 0
H 1: 𝑑 ≠ 0
Result of the test…
p-value = P( 𝒅 | H0 )
…interpretation of the test
p-value is very small: reject H0
otherwise: accept H0

H 0: 𝑑 = 0
H 1: 𝑑 ≠ 0

We accept/reject H0…
(based on the p-value and α)

…not the test!

Usual (wrong) conclusions
A is substantially than B

A is much better than B

The difference is important

The difference is significant

What does it mean?
That there is a difference
(unlikely due to chance/random error)

What does it mean?
That there is a difference
(unlikely due to chance/random error)

We don’t need fancy statistics…

…we already know
they are different!

H0: 𝒅 = 0
is false by definition

because systems A and B
are different to begin with

What is really important?
The effect-size:
magnitude of 𝑑
This is what predicts user
satisfaction, not p-values

What is really important?
The effect-size:
magnitude of 𝑑
This is what predicts user
satisfaction, not p-values

𝒅 = +0.6 is a huge improvement
𝒅 = +0.0001 is irrelevant…
…and yet, it can easily be
statistically significant

Example: t-test
𝒅· 𝓠 The larger the statistic 𝑡,
𝒕=
𝒔𝒅 the smaller the p-value

How to achieve statistical significance?

Example: t-test
𝒕=

a) Reduce variance

Example: t-test
𝒕=

a) Reduce variance
b) Further improve the system

Example: t-test
𝒕=

a) Reduce variance
b) Further improve the system
c) Evaluate with more queries!

Statistical Significance is
eventually meaningless…

…all you have to do is
use enough queries

Practical Significance: Effect-Size 𝑑
Effectiveness / Satisfaction
Statistical Significance: p-value
Confidence

An improvement may be
statistically significant, but that
doesn’t mean it’s important!

the real importance
of an improvement

Purpose of Evaluation
How good Is system A
is my system? better than
system B?

0 1 -1 0 1
effectiveness Δeffectiveness

We measure system effectiveness

Assumption
System Effectiveness
corresponds to
User Satisfaction
user satisfaction

system effectiveness

Assumption
corresponds to
User Satisfaction
this is our
ultimate goal!

Does it? How well?

How we measure
Similarity scale we normalize
to [0, 1]
Broad: 0, 1 or 2
Fine: 0, 1, 2, ..., 100
Effectiveness measure
AG@5: ignore the ranking
nDCG@5: discount by rank

What correlates better
with user satisfaction?

Experiment

known
effectiveness

Experiment

user preference

Experiment

non-preference

What can we infer?
Preference
(difference noticed by user)
Positive: user agrees with evaluation
Negative: user disagrees with evaluation

Non-preference
(difference not noticed by user)
Good: both systems are satisfying
Bad: both systems are unsatisfying

Data
Clips and Similarity Judgments from
MIREX 2011 Audio Music Similarity

Random and Artificial examples
Query: selected randomly
System outputs: random lists of 5 documents

2200 examples for 73 unique queries
2869 unique lists with 3031 unique clips
balanced and complete design

Subjects
Crowdsourcing
Cheap, fast and… diverse pool of subjects

2200 Quality
examples control

Trap examples (known answers)

$0.03 per example Worker pool

Results
6895 total answers
881 workers from 62 countries

3393 accepted answers (41%)
100 workers (87% rejected!)

95% average quality when accepted

How good is my system?
884 nonpreferences (40%)

What do we expect?


Linear
mapping


What do we have?

room for ~20%
improvement
with
personalization

Is system A better than B?
1316 preferences (60%)

What do we expect?


Users always notice
the difference…

…regardless of
how large it is


What do we have?


>.3 & >.4 differences for
>50% of users to agree


Fine scale is closer
to the ideal 100%


Do users prefer the
(supposedly)
worse system?

Statistical Significance

has nothing
to do with this

Reporting Results
Confidence intervals / Variance

0.584

Reporting Results
Confidence intervals / Variance

0.584 ± .023
Indicator of evaluation error
Better understanding of
expected user satisfaction

Reporting Results
Actual p-values

+0.037 ± .031 *

Reporting Results
Actual p-values

+0.037 ± .031 (p=0.02)
Statistical Significance is relative
α=0.05 and α=0.01
are completely arbitrary
Depends on context, cost of Type I
errors and implementation, etc.

let’s review two papers

(again)

paper A:
+0.14*
paper B:
+0.21

paper A (500 queries):
+0.14 ± 0.03 (p=0.048)
paper B (50 queries):
+0.21 ± 0.02 (p=0.052)

paper A:
+0.14 *
paper B:
+0.14 *

paper A (cost=$500,000):
+0.14 ± 0.01 (p=0.004)
paper B (cost=$50):
+0.14 ± 0.03 (p=0.043)

effect-sizes are
indicators of user satisfaction
need to personalize results
small differences are not noticed

p-values are
indicators of confidence
beware of collection size

need to provide full reports

The difference between
“Significant” and
“Not Significant”
is not itself
statistically significant
― A. Gelman & H. Stern

How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval

Ähnlich wie How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval (20)

Mehr von Julián Urbano

Mehr von Julián Urbano (20)

How Significant is Statistically Significant? The case of Audio Music Similarity and Retrieval