This is a lecture that I gave to a Principles of Epidemiology MPH class. It takes a critical look at the use of p-values to judge the strength of evidence, and offers more holistic, informative approaches to interpreting statistical findings such as measures of effect size and confidence intervals.
Hypothesis Testing, Effect Size, Confidence Intervals, & the p-Value Fallacy
1. What’s Significant?
Hypothesis Testing, Effect Size, Confidence
Intervals, & the p-Value Fallacy
Patrick B. Barlow, The University of Tennessee
2. On the Agenda…
• Recap of causation
• The basics of hypothesis testing
– From research question to testable hypothesis
• Effect size
– What is it?
– What can impact effect size?
• Confidence Intervals
– What are they?
– How do you interpret?
– What are the implications for interpreting statistical findings?
• Statistical significance & p-values
– What counts as “statistically significant”?
– Weaknesses of the p-value
– The p-value fallacy
• Putting it all Together
3. Recap: Bradford Hill
Criteria
• Strength of causal
inference is affected
by a number of
different factors:
– Strength of
association
– Consistency
– Specificity
– Temporal
relationship
– Biological gradient
– Plausibility
– Coherence
– Experiment
(reversibility)
– Analogy
(consideration of
alternate
explanations)
4. From research question to testable hypothesis
Statistical significance & p-values
THE BASICS OF HYPOTHESIS
TESTING
5. The Basics of Hypothesis Testing
In statistics, hypothesis testing forms the basis for the majority of
inferential statistical tests.
• Three basic components:
– Null hypothesis (H0)
– Alternative/research hypothesis (H1)
– Error
• Type I
• Type II
• Was originally conceived as a way to minimize error over infinite trials
rather than specify the absolute “truth” in a single scenario.
– Goodman equated hypothesis testing to, “a system of justice that is not concerned with
which individual defendant is found guilty or innocent…but tries instead to control the
overall number of incorrect verdicts.”
6. The Basics of Hypothesis Testing
Null Hypothesis (H0) Alternative Hypothesis (H1)
• Almost always the • The statement that you will
statement that no be trying to “prove” by
difference or relationship conducting your inferential
exists between the variables statistics.
of interest. • It is almost always the
• Example: A study looking statement that a difference
at deep vein thrombosis or relationship does exist
(DVT) & the risk of between the variables of
pulmonary embolism (PE) interest.
– The null hypothesis would
be… • What would be an alternative
– “Having DVT does not hypothesis for our example?
increase one’s risk for – “Having DVT increases the
developing a PE.” risk of developing a PE.”
7. The Basics of Hypothesis Testing
The two most common errors we encounter in statistical testing are Type I
& Type II error. Both of these errors pose serious risks to the integrity of
your conclusions if ignored.
• Type I error: falsely concluding a statistically significant relationship
does exist when in fact it does not
– “Alpha”, “False positive”, “False alarm”, “Red-herring”, etc.
– Origin of the “p<.05” as statistically significant.
• Type II error: failing to detect a statistically significant relationship
when in fact one does exist
– “Beta”, “Miss”, “False negative”
– Statistical power & Type II error
The probability for committing either error is interdependent, so the researcher/analyst
must consider which error would be more costly to their study.
8. Your Turn
Questions
Instructions (for each research topic)
1. What is your research question?
2. What would you propose to use
as a research design?
In groups of 2-3, work 3. What would be the null
together to brainstorm at hypothesis?
4. What are two possible
least two research alternative/research hypotheses
questions/topics, & that could be tested?
5. Considering the relationship
answer each of the between Type I & II error, which
following questions: would be more costly/serious to
commit if conducting your
particular study?
Be prepared to discuss your answers!
9. What is it?
How do we interpret effect sizes?
How does effect size relate to issues of statistical power, sample
size, and error?
EFFECT SIZE
10. What is it?
Generally speaking, the effect
size represents the magnitude
or strength of the relationship
between two variables.
• The proportion of variance
in the DV explained by your
IV.
• Example…
• The difference in the mean
on your DV among levels of
your IV.
• Example…
• The difference in proportion
of patients with an outcome
in the exposed vs. the
unexposed groups of your
IV.
Two types
1. Unstandardized Effect
Sizes:
2. Standardized Effect Sizes:
11. How do we interpret
unstandardized Average BMI Between Men & Women
effect sizes? Following Physical Fitness Intervention
29
Interpreted in the same 28.5
metric as your variables. 28
Mean
Example:
27 difference = 3.0
26
26
kg/m2
In a fitness study looking at 25
differences between the
Average BMI
25
sexes, men (M=26.0, Men
SD=3.0) reported 24 Women
significantly higher average
BMI than women (M=23.0, 23
23
SD=2.5), p = .02.
22
What is the unstandardized
effect size? 21
20
Pre Intervention Post Intervention
12. Your Turn
In pairs, calculate & interpret (in sentence format) the unstandardized effect
size. Be ready to share your interpretations.
1. Patients admitted to “academic” hospital clinics (M=.50, SD=.40) had
lower average 90-day readmissions than patients seen by non-
academic clinics (M=1.5, SD=.75), p = .02.
2. A researcher looks at differences in number of side effects patients had
on three difference drugs (A, B, and C). Comparison of Drug “A” to
Drug “B” shows average side effects to be 4(SD=2.5) and
7(SD=4.8), respectively, p=.04
3. An article shows a difference in average number of COPD-related
readmissions before (M=1.5, SD=2.0) and after (M=.05, SD=.90) a
patient education intervention, p=.08.
4. An article shows a difference in average number of COPD-related
readmissions before (M=1.5, SD=2.0) and after (M=.05, SD=.90), and
six months following a patient education intervention
(M=0.80, SD=3.0), p =.12
13. How do we interpret standardized
effect sizes?
Two of the most common standardized effect
sizes are Risk / Odds Ratios and Pearson r/R2
14. Interpreting ORs and RRs
• Odds/Risk ratio ABOVE 1.0 = Your exposure INCREASES
risk of the event occurring
– For OR/RRs between 1.00 and 1.99 the risk is increased by
(OR – 1)%.
– For OR/RRs 2.00 or higher, the risk is increased OR
times, but you could also still use (OR – 1)%.
• Example:
– Smoking is found to increase your odds of breast cancer by
OR = 1.25. What is the increase in odds?
• You are 25% more likely to have breast cancer if you are a smoker.
– Smoking is found to increase your risk of developing lung
cancer by RR = 4.8. What is the increase in risk?
• You are 4.8 times more likely to develop lung cancer if you are a
smoker vs. non-smoker.
15. Interpreting ORs and RRs
• Odds/Risk ratio BELOW 1.0 = Your exposure
DECREASES risk of the event occurring
– The risk is decreased by (1 – OR)%
– Often called a PROTECTIVE effect
• Example:
– Addition of the new guidelines for pacemaker/ICD
interrogation produced an OR for device
interrogation of OR = .30 versus the old
guidelines. What is the reduction in odds?
• (1 – OR) = (1 – .30) = 70% reduction in odds.
16. Your Turn
Instructions Practice
Feel free to make up your 1. OR = 3.00
own examples or just 2. OR = .39
use, “Odds/Risk of 3. RR = 1.50
having disease if you 4. OR = 1.00
have the exposure of 5. RR = .22
interest.” 6. RR = 18.99
7. OR = .78
What does the OR/RR 8. RR = 6.30
say about the strength of
relationship?
17. Interpreting r / R2
Pearson r R2
• Provides the strength • Literally calculated the
of a linear relationship square of an r statistic.
between exactly two • Also known as the
continuous, quantitati coefficient of
ve variables. determination
• Can vary between -1 • Provides the
(perfect negative) to 1 proportion of shared
(perfect positive) variance between your
• Most correlational IV and DV
studies only report r – What’s the range?
19. How does effect size relate to issues of
statistical power, sample size, and
error?
Effect size vs. Statistical Power, sample
size, and error.
• As effect size increases , statistical
power also increases . Which means
that (1) you need a smaller sample
size, and (2) have a lower chance of
making a Type II error (i.e. a “miss”).
So, when possible, measure for a large effect
size!
20. An OR/RR is only as
important as the
confidence interval
that comes with it!
What are they?
How do you interpret?
How do they affect our conclusions?
CONFIDENCE
INTERVALS
21. What are they?
• Confidence intervals provide, as the name suggests, the confidence in a
particular inferential statistic.
• Provide the range of values within which we are confident the true
population parameter (e.g. mean, proportion, etc.) exists.
• Usually set at 95%
• They are calculated by using:
• Standard error of measurement (Sm or SE)
• Point estimate for your sample (e.g. t statistic)
• Degrees of freedom for the sample
22. What are they? OR /
RR example
95% Confidence intervals are added
to any OR/RR calculation to provide
an estimate on the accuracy of the
estimation.
• Size Matters!
– Wide CI = weaker inference
– Narrow CI = stronger
inference
– CI crosses over 1.0 = non-
significant
• Any 95% CI can instantly tell us:
1. Sample size
2. Accuracy of estimation
3. Statistical significance
1.0
23. Interpreting 95% Confidence
Intervals
95% CI of an Odds or Risk
Your Turn
Ratio
• What you read… Interpret these 95% CIs
– OR = 4.5 (95% CI =2.8 – 1. OR 2.4 (95% CI 1.7 - 3.3)
6.1)
• What you interpret… 2. OR 6.7 (95% CI 1.4 -
– Lower bounds: OR = 2.8 107.2)
– Upper bounds: OR = 6.1
• How you interpret… 3. OR 1.2 (95% CI .147 - 1.97)
– “We are 95% confident 4. OR .37 (95% CI .22 - .56)
that the true odds of
disease for exposed vs. 5. OR .57 (95% CI .12 - .99)
unexposed lies between
2.8 and 6.1.” 6. OR .78 (95% CI .36 – 1.65)
24. What counts as “statistically significant”?
Weaknesses of the p-value
The p-value fallacy
STATISTICAL
SIGNIFICANCE
25. What counts as “Statistically
significant?”
• To be considered statistically significant, the
probability of obtaining a value of the test
statistic (e.g. t, z, F, or χ2) must smaller than the
probability for committing a Type I error.
• In other words, the probability (p) must be less
than (<) what you have chosen for your alpha
value (.05).
– So, in most cases we conclude that a relationship if
statistically significant if the test returns a p<.05.
26. Interpretation & Practice
• If a statistically significant relationship is
found, then we conclude that observed
relationship is too great to exist by chance
alone.
• Which of the following are statistically
significant results?
1. t(34)=5.89, p = .002
2. F(3, 285)=1.09, p = .101
3. χ2(4)=18.78, p = .04
4. t(68) = 4.25, p = .05
27. Weakness of p-values
• Not truly compatible with hypothesis testing
– Absence of evidence vs. evidence of absence
• Never meant to be the sole indicator of significance
– Average knowledge of statistical interpretation in evidence-based
professions
• No consideration of effect size
• What influences p-values?
– Sample size
– Chance
– Effect size
– Statistical power
28. The “p-value fallacy”
P-values have become the “have your cake and eat
it too” of the statistical world.
• You get the supposed accuracy of a single study
(short term) while being able to simultaneously
avoid errors in the long term.
• Comes from misinterpretation of p-values as
absolute indicators of the strength of a
relationship. That is, seeing p = .03 as more
significant than p = .04.
29. How to use multiple sources to become a better consumer of
Epidemiologic Evidence
PUTTING IT ALL TOGETHER
30. Going beyond the p-value
• Measures of effect size provides a far more vivid
description of the magnitude of the relationship.
– An OR of 4.30 is stronger than an OR of 1.50.
– A mean difference of 35pts is larger than a mean
difference of 20pts.
– 65% of the variance is more than 20% of the variance
• The 95% CI provides far more information on the
accuracy of the inference.
– Which is more accurate?
• OR = 2.5 (95% CI = 1.2 – 10.0) vs. OR = 2.5 (95% CI = 1.2 –
3.1)
31. When reading an article…
Always consider:
1. What is the research question? Have the
researchers used the correct null &
alternative hypotheses?
2. How large is the…
− Sample? Subgroup? Etc.
− Effect size? (standardized or unstandardized)
− Confidence interval?
3. Finally, what is the p-value?
32. Just because a finding is not
significant does not mean
that it is not meaningful.
You should always consider
the effect size and context of
the research when making a
decision about whether or
not any finding is clinically
relevant.
Hinweis der Redaktion
Alternatively, the second example could be interpreted as: “Smoking increases your risk of lung cancer by 380% vs. non-smoking”