On p-values

On p-values
Maarten van Smeden
Annual Julius Symposium 2016

About
• statistician by training
• phd (2016): diagnostic research in absence gold standard
(JC)
• post-doc: biostatistics / epidemiological methods (JC)

About this workshop
p-value?
ASA statement: why and what?
p-value alternatives?

Go to:
pvalue.presenterswall.nl

Point of departure
skeptical whenever I see a p-value

The pioneers
Ronald Aylmer Fisher  
(1890 - 1962)
Jerzy Neyman  
(1894-1981)
Egon Pearson  
(1895-1980)

p-value ≥ α
“no effect”
p-value < α
“effect!”
α = .05, unless…

… the p-value fails
“arguably significant” (P = 0.07)
“direction heading to significance” (P = 0.10)
“flirting with conventional levels of significance” (P > 0.1)
“marginally significant” (P ≥ 0.1)
convenient sample from: https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/
listing 509 expressions for non-significant results at α = .05 level (24 October 2016)

+ 23!!! supplementary ﬁles
Wasserstein & Lazar (2016) The ASA's Statement on p-Values:  
Context, Process, and Purpose, The American Statistician, 70:2, 129-133

A few quotes (1)
“The ASA has not previously taken positions on speciﬁc
matters of statistical practice.” 
nb. founded in 1839
“Nothing in the ASA statement is new.”
from the ASA Statement

A few quotes (2)
“… process was lengthier and more controversial than
anticipated.”
“… the statement articulates in non-technical terms a few select
principles that could improve the conduct or interpretation of
quantitative science, according to widespread consensus in the
statistical community."

Go to
pvalue.presenterswall.nl

Why do we need a statement?
‘“It’s science’s dirtiest secret: The ‘scientiﬁc method’ of testing
hypotheses by statistical analysis stands on a ﬂimsy
foundation.”’
Quoting Siegfried (2010), Odds Are, It’s Wrong: Science Fails to Face the Shortcomings of Statistics, Science News, 177, 26.
from the ASA Statement: Wasserstein & Lazar (2016) The ASA's Statement on p-Values:  
Context, Process, and Purpose, The American Statistician, 70:2, 129-133

OK, but why now?
“… highly visible discussions over the last few years”
“The statistical community has been deeply concerned about
issues of reproducibility and replicability …”
from the ASA statement

In popular media
http://www.vox.com/2016/3/15/11225162/p-value-simple-deﬁnition-hacking
(~ 50 million unique visitors monthly)

Drastic measures…
NHST = Null hypothesis signiﬁcance testing

P-value increasingly central in reporting
From: Chavalarias et al. JAMA. 2016;315(11):1141-1148, doi:10.1001/jama.2016.1952
Using text-mining >1.6 million abstracts

In the large (‘big’) data era
“With a combination of large datasets, confounding, ﬂexibility in
analytical choices …, and superimposed selective reporting
bias, using a P < 0.05 threshold to declare “success,” ….  
means next to nothing.”
From ASA supplementary material, response by Ioannidis.

To summarise: why?
• p-values and the P < .05 rule are at the core of inference in
today’s science (social, biomedical, …)
• there is growing concern that these inference are often wrong
• perhaps, if we understand p-values better, we’ll be less
often wrong

The statement: 6 principles
1. P-values can indicate how incompatible the data are with a specified
statistical model.
2. P-values do not measure the probability that the studied hypothesis is
true, or the probability that the data were produced by random chance
alone.
3. Scientific conclusions and business or policy decisions should not be
based only on whether a p-value passes a specific threshold.
4. Proper inference requires full reporting and transparency.
5. A p-value, or statistical significance, does not measure the size of an
effect or the importance of a result.
6. By itself, a p-value does not provide a good measure of evidence
regarding a model or hypothesis.

Statistical model?
• every method of statistical inference relies on a web of
assumptions which together can be viewed as a ‘statistical
model’
• the tested hypothesis is one of these assumptions. Often a
‘zero-effect’ called ‘null hypothesis’

About assumptions
the calculation of p-values always relies on assumptions
besides the hypothesis tested. It is easy to ignore/forget those
assumptions while analysing.
Your assumptions are your windows on the world.
Scrub them off every once in a while, or the light
won't come in.
Alan Alda

statistical model.
2. P-values do not measure the probability that the studied hypothesis
is true, or the probability that the data were produced by random
chance alone.

From a probability point of view
p-value*: P(Data|Hypothesis)
is not: P(Hypothesis|Data)
*Somewhat simpliﬁed, correct notation would be: P(T(X) ≥ x | Hypothesis)

Does it matter?
P(Death|Handgun)
= 5% to 20%*
P(Handgun|Death)
= 0.028%**
* from New York Times (http://www.nytimes.com article published: 2008/04/03/)
** from CBS StatLine (concerning deaths and registered gun crimes in 2015 in the Netherlands)

If there only was a way…
P(Data|Hypothesis)
P(Hypothesis|Data)

There is…
reverend Thomas Bayes 
(1702-1761)
P(H|D) =
P(D|H) P(H)
P(D)

On bright-line rules
“Practices that reduce data analysis or scientiﬁc
inference to mechanical “bright-line” rules (such as “p <
0.05”) for justifying scientiﬁc claims or conclusions can
lead to erroneous beliefs and poor decision making. A
conclusion does not immediately become “true” on
one side of the divide and “false” on the other.”

If p ~ .05
D Colquhoun (2014). An investigation of the false discovery rate and the misinterpretation of p-values. R.Soc.opensci.1:140216.
“If you want to avoid making a fool of yourself very often, do not
regard anything greater than p < 0.001 as a demonstration that
you have discovered something”

The issue of pre-speciﬁed hypotheses
From: http://compare-trials.org/ accessed on November 20 2016

Ed Yong (2012). Replication studies: Bad copy, Nature. Data credits to: D Fanelli.

Why is this enormous positivity?
If you torture the data long enough,
it will confess to anything
Ronald Coase
besides journal editors requirement for p < .05

Multiple (potential) comparisons
aka 
- p-hacking 
- data fishing 
- data dredging 
- multiple testing 
- multiplicity 
- significance chasing 
- significance questing 
- selective inference 
- etc.

Selective reporting
“Whenever a researcher chooses what to present based on
statistical results, valid interpretation of those results is
severely compromised if the reader is not informed of the choice
and its basis. Researchers should disclose the number of
hypotheses explored during the study, all data collection
decisions, all statistical analyses conducted, and all p-
values computed. Valid scientiﬁc conclusions based on p-
values and related statistics cannot be drawn without at least
knowing how many and which analyses were conducted, and
how those analyses (including p-values) were selected for
reporting.”

statistical model.
2. P-values do not measure the probability that the studied hypothesis is
true, or the probability that the data were produced by random chance
alone.
5. A p-value, or statistical signiﬁcance, does not measure the size of
an effect or the importance of a result.

About effect size
• statistical signiﬁcance does not imply practical importance
• to understand practical importance we need information on
the effect size
• Is the p-value a good measure for effect size?

Dance of the p-values
https://www.youtube.com/watch?v=5OL1RqHrZQ8&t=10s
Credits to Professor Geoff Cumming

statistical model.
chance alone.
3. Scientiﬁc conclusions and business or policy decisions should not
be based only on whether a p-value passes a speciﬁc threshold.

P-values in isolation
“Researchers should recognize that a p-value without context
or other evidence provides limited information. For example, a
p-value near 0.05 taken by itself offers only weak evidence
against the null hypothesis. Likewise, a relatively large p-value
does not imply evidence in favour of the null hypothesis; many
other hypotheses may be equally or more consistent with the
observed data. For these reasons, data analysis should not
end with the calculation of a p-value when other approaches
are appropriate and feasible.”

statistical model.
chance alone.
3. Scientiﬁc conclusions and business or policy decisions should not
be based only on whether a p-value passes a speciﬁc threshold.

Agreement reached?
“you can believe me that had it been any stronger, then all but
one of the statisticians would have resigned.”
“If only the rest could have agreed with me, we would have a
much stronger statement.”
from SlideShare, by Stephen Senn: P Values and the art of herding cats (accessed on Oct 30 2016)
Stephen Senn, involved in the ASA statement

From a practical point of view
if you work with p-values (derived from the 6 ASA principles):
1. think carefully about the underlying assumptions
2. avoid statements about the truth of the tested hypothesis
3. avoid strong statements about effect based solely on p < .
05 or absence of effect based solely on p > .05
4. report no. and sequence of analyses; avoid data torture
5. avoid statements about effect size based on p-value
6. if feasible, use additional information from other inferential
tools

p-value? 
why? 
what? 
p-value alternatives?

Other approaches
• Methods that emphasise estimation rather than testing
• conﬁdence intervals
• prediction intervals
• credible intervals
• Bayesian methods
• Alternative measures of evidence
• likelihood ratios
• Bayes factors
• Other approaches
• Decision-theoretic modelling
• False discovery rates
From ASA statement

A too short introduction to Bayesian inference
Remember Bayes?
reverend Thomas Bayes 
(1702-1761)

Using Bayes theorem
P(θ|D) =
P(D|θ) P(θ)
P(D)
P(θ|D) ∝ P(D|θ) P(θ)
“likelihood” “prior distribution”
“posterior distribution”

Rational for Bayesian inference
the posterior distribution (θ|D) is “more informative” than the
likelihood (D|θ)
However:
“Proponents of the “Bayesian revolution” should be wary of
chasing het another chimera: an apparently universal inference
procedure. A better path would be to promote both an
understanding of various devices in the “statistical toolbox” and
informed judgment to select among these.” 
Gigerenzer and Marewski (2015), Surrogate Science: The Idol of a Universal Method for Scientiﬁc Inference. Journal of Management

p-value? 
why? 
what? 
p-value alternatives? 
some ﬁnal remarks

The words of the pioneer
No scientific worker has a fixed level of
significance at which from year to year, and in
all circumstances, he rejects hypotheses; he rather
gives his mind to each particular case in the light of
his evidence and his ideas.
Ronald Fisher

Many initiatives to improve science…
see: http://www.scienceintransition.nl/english

and reduce waste
~ 85% of all health research is being avoidably “wasted”
see also: http://blogs.bmj.com/bmj/2016/01/14/paul-glasziou-and-iain-chalmers-is-85-of-health-research-really-wasted/,
and: Lancet’s 2014 series on increasing value, reducing waste (incl video’s etc.): http://www.thelancet.com/series/research

Conclusion
• statistical inference is inherently difﬁcult; we should avoid
making a fool of ourselves too often
• p-values can be useful tools for inference; most often, p-
values should not be the ‘star of the inference show’
• bright line rules such as p < .05 give a false sense of
scientiﬁc objectivity
• like to play around with data? Me too! Think twice before you
publish such explorations; if you do, be honest and
transparent in reporting

Some random thoughts
• inference is thought as a primarily mathematical or
computational problem, it should not.
• we should ban the term “signiﬁcant” from scientiﬁc output
for describing effects that are accompanied with p < .05.
• in applied statistics education, we should invest more time
in discussing various forms of inference (e.g., Bayesian
inference) and their merits and pitfalls

Points for discussion
• is there a need for changing the way we do inference?
• if so, how and what do we change?
• education?
• journals?
• should we downplay the role of p < .05 in scientiﬁc output?

On p-values

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie On p-values

Ähnlich wie On p-values (20)

Mehr von Maarten van Smeden

Mehr von Maarten van Smeden (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

On p-values