A Lenda do Valor P

ECONOMICS, EDUCATION, AND HEALTH SYSTEMS RESEARCH
SECTION EDITOR
RONALD D. MILLER
EDITORIAL

The Legend of the P Value
Zeev N. Kain, MD, MBA
Center for the Advancement of Perioperative Health and Department of Anesthesiology & Pediatrics & Child Psychiatry,
Yale University School of Medicine, New Haven, Connecticut

A
lthough there is a growing body of literature related to this complex problem. Please note that a
criticizing the use of mere statistical significance detailed discussion of the underlying statistics in-
as a measure of clinical impact, much of this volved in this topic is beyond the scope of this
literature remains out of the purview of the discipline editorial.
of anesthesiology. Currently, the magical boundary of When examining the report of a clinical trial inves-
P 0.05 is a major factor in determining whether a tigating a new treatment, clinicians should be inter-
manuscript will be accepted for publication or a re- ested in answering the following three basic questions:
search grant will be funded. Similarly, the Federal
1. Could the findings of the clinical trial be solely a
Drug Administration does not currently consider the
result of a chance occurrence? (i.e., statistical
magnitude of an advantage that a new drug shows
significance)
over placebo. As long as the difference is statistically
2. How large is the difference between the primary
significant, a drug can be advertised in the United
end-points of the study groups? (i.e., impact of
States as “effective” whether clinical trials proved it to
treatment, effect size)
be 10% or 200% more effective than placebo. We sub-
3. Is the difference of primary end-points between
mit that if a treatment is to be useful to our patients, it
groups meaningful to a patient? (i.e., clinical
is not enough for treatment effects to be statistically
significance)
significant; they also need to be large enough to be
clinically meaningful. It was Sir Ronald A. Fisher, an extraordinarily in-
Unfortunately, physicians often misinterpret statis- fluential British statistician, who first suggested the
tically significant results as showing clinical signifi- use of a boundary to accept or reject a null hypothesis,
cance as well. One should realize, however, that with and he arbitrarily set this boundary at P 0.05; where
a large sample it is quite possible to have a statistically “P” stands for probability related to chance (1,2). That
significant result between groups despite a minimal is, the level of statistical significance as defined by
impact of treatment (i.e., small effect size). Also, study Fisher in 1925 and as used today refers to the proba-
outcomes with lower P values are typically misinter- bility that the difference between two groups would
preted by physicians as having stronger effects than have occurred solely by chance (i.e., probability of 5
those with higher P values. That is, most clinicians in 100 is reported as P 0.05). Fisher’s emphasis on
agree that a result with a P 0.002 has a much greater significance testing and the arbitrary boundary of P
treatment effect than a result of P 0.045. Although 0.05 has been widely criticized over the past 80 yr.
this is true if the sample size is the same in both This criticism was based on the rationale that focusing
studies, it is not true if the sample size is larger in the on the P value does not take into account the size and
study with the smaller P value. This is of particular clinical significance of the observed effect. That is, a
concern when one realizes that most pharmaceutically small effect in a study with large sample size has the
funded studies have very large sample sizes and effect same P value as a large effect in a study with a small
sizes are typically not reported in these types of stud- sample size. Also, P value is commonly misinter-
ies. In the following editorial I highlight some of issues preted when there are multiple comparisons, in which
case a traditional level of statistical significance of P
Supported, in part, by National Institutes of Health grants 0.05 is no longer valid. Fisher himself indicated some
NICHD, R01HD37007– 02. 25 yr after his initial publication that “If P is between
Accepted for publication June 16, 2005. 0.1 and 0.9 there is certainly no reason to suspect the
Address correspondence and reprint requests to Zeev N. Kain,
MD MBA, Department of Anesthesiology, Yale University School of
hypothesis tested. If it is below 0.02 it is strongly
Med, 333 Cedar Street, New Haven, CT 06510. Address e-mail to indicated that the hypothesis fails to account for the
zeev.kain@yale.edu. whole of the facts. We shall not often be astray if we
DOI: 10.1213/01.ANE.0000181331.59738.66 draw a conventional line at 0.05. . .” (3). Indeed, this

©2005 by the International Anesthesia Research Society
1454 Anesth Analg 2005;101:1454–6 0003-2999/05

ANESTH ANALG EDITORIAL 1455
2005;101:1454 –6

issue has been addressed in multiple recent review Clinicians should be cautioned to not interpret mag-
articles and editorials in the general medical and psy- nitude of change (effect size) as an indication of clin-
chological literature (4 – 8). ical significance. The clinical significance of a treat-
In an attempt to address some of the limitations of ment should be based on external standards provided
the P value, the use of the confidence intervals (CI) has by patients and clinicians. That is, a small effect size
been advocated by some clinicians (9). One should may still be clinically significant and, likewise, a large
realize, however, that these two definitions of statisti- effect size may not be clinically significant, depending
cal significance are essentially reciprocal (10). That is, on what is being studied. Indeed, there is a growing
getting a P 0.05 is the same as having a 95% CI that recognition that traditional methods used, such as
does not overlap zero. CIs can also, however, be used statistical significance tests and effect sizes, should be
to estimate the size of difference between groups in supplemented with methods for determining clini-
addition to merely indicating the existence or absence cally significant changes. Although there is little con-
of statistical significance (11). This later approach, sensus about the criteria for these efficacy standards,
however, is not widely used in the medical and psy- the most prominent definitions of clinically significant
chological literature, and today CIs are mostly used as change include: 1) treated patients make a statistically
surrogates for the hypothesis test rather than consid- reliable improvement in the change scores; 2) treated
ering the full range of likely effect size. patients are empirically indistinguishable from a nor-
The group of statistics called “effect sizes” designate mal population after treatment, or 3) changes of at
indices that measure the magnitude of difference be- least one sd. The most frequently used method for
tween groups, controlling for variation within the evaluating the reliability of change scores is the
groups; effect sizes can be thought of as a standard- Jacobson-Truax method in combination with clinical
ized difference. In other words, although a P value cutoff points (15). Using this method, change is con-
denotes whether the difference between two groups in sidered reliable, or unlikely to be the product of meas-
a particular study is likely to occur solely by chance, urement error, if the reliable change index (RCI) is
the effect size quantifies the amount of difference be- more than 1.96. That is, when the individual has a
tween the two groups. Quantification of effect size change score more than 1.96, one can reasonably as-
does not rely on sample size but instead relies on the sume that the individual has improved.
strength of the intervention. There are a number of Unfortunately, most of the methods above are dif-
different types of effect sizes and a description of these ficult to adopt in the perioperative arena, as compar-
various types and formulae is beyond the scope of this ison with a normal population is not an option in most
editorial. We refer the interested reader to review trials, and the RCI, which controls for statistical issues
articles that describe the various types of effect sizes involving the assessment tool, is a somewhat compli-
and their calculation methodology (12,13). Effect sizes cated and controversial technique. Thus, clinical sig-
of the d type are the most commonly used in the nificance in the perioperative arena may be best as-
medical literature, as they are primarily used to com- sessed by posing a particular question such as “is a
pare two treatment groups. D type effect size is de- change of 8.5% reduction in intraoperative bleed clin-
fined as the magnitude of difference between two ically significant?” or “how many sd does this change
means, divided by the sd [(Mean of control group represent?” Obviously, both of these questions have a
Mean of treatment group)/sd of the control group]. subjective component in them and although it is tra-
Thus, the d effect size is dependent on variation ditionally agreed that at least a 1-sd change is gener-
within the control group and the differences between ally needed for clinical significance, this boundary has
the control and intervention groups. Values of the d no scientific underpinning. The validity of a clinical
type effect sizes range from to , where zero cutoff for these last two methods can be improved by
denotes no effect and values less than or more than establishing external validity (e.g., patient perspec-
zero are treated as absolute values when interpreting tive) for the decision. For example, Flor et al. (16) have
magnitude. Conventionally, d type effect sizes that are conducted a large meta-analysis that was aimed at
near 0.20 are interpreted as small, effect sizes near 0.50 evaluating the effectiveness of multidisciplinary reha-
are considered “medium,” and effect sizes in the range bilitation for chronic pain. The investigators found
of 0.80 are considered “large” (14). However, interpre- that pain among the patients who received the inter-
tation of the magnitude of an effect size depends on vention was indeed reduced by 25%. This reduction
the type of data gathered and the discipline involved. was certainly statistically significant and had an effect
Effect sizes of another type—the risk potency type— size of 0.7. Colvin et al. (17), however, reported earlier
include likelihood ratios such as odds ratio, risk ratio, that patients would consider only a 50% improvement
risk difference, and relative risk reduction. Clinicians in their pain levels as a treatment “success.” Thus, in
are probably more familiar with these less abstract this example, a reduction of 25% in pain scores may be
statistics and it may be helpful to realize that likeli- statistically, but not clinically, significant. Clearly this
hood statistics are a type of effect size. is a developing area that warrants further discussion.

1456 EDITORIAL ANESTH ANALG
2005;101:1454 –6

In conclusion, we suggest that reporting of periop- 7. Greenstein G. Clinical versus statistical significance as they
relate to the efficacy of periodontal therapy. J Am Dent Assoc
erative medical research should continue beyond re- 2003;134:1168 –70.
porting results consisting primarily of descriptive and 8. Sterne JAC, Smith GD, Cox DR. Sifting the evidence: what’s
statistically significant or nonsignificant findings. The wrong with significance tests? Another comment on the role of
interpretation of findings should occur in the context statistical methods. BMJ 2001;322:226 –31.
9. Simon R. Confidence intervals for reporting results of clinical
of the magnitude of change that occurred and the trials. Ann Intern Med 1986;105:429 –35.
clinical significance of the findings. 10. Feinstein AR. P-values and confidence intervals: two sides of
the same unsatisfactory coin. J Clin Epidemiol 1998;51:355– 60.
11. Gardner MG, Altman DG. Confidence intervals rather than P
values: estimation rather than hypothesis testing. BMJ 1986;292:
References 746 –50.
1. Fisher RA. Statistical methods for research workers, 1st ed. 12. Kirk R. Practical significance: A concept whose time has come.
Edinburgh: Oliver and Boyd, 1925. Reprinted by Oxford Uni- Educ Psychol Meas 1996;56:746 –59.
versity Press. 13. Snyder P, Lawson S. Evaluating results using corrected and
2. Fisher RA. Design of experiments. 1st ed. Edinburgh: Oliver and uncorrected effect size estimates. J Exper Educ 1993;61:334 –349.
Boyd, 1935. Reprinted by Oxford University Press. 14. Cohen J. Statistical power analysis for the behavioral sciences,
3. Fisher RA. Statistical methods for research workers. London: 2nd ed. Mahwah, New Jersey: Lawrence Erlbaum, 1988.
Oliver and Boyd, 1950:80. 15. Jacobson NS, Truax P. Clinical significance: A statistical ap-
4. Borenstein M. Hypothesis testing and effect size estimation in proach to defining meaningful change in psychotherapy re-
clinical trials. Ann Allergy Asthma Immunol 1997;78:5–11. search. J Consult Clinic Psych 1991;59:12–9.
5. Matthey S. P 0.05: but is it clinically significant? Practical 16. Flor H, Fydrich T, Turk DC. Efficacy of multidisciplinary pain
examples for clinicians. Behav Change 1998;15:140 – 6. treatment centers: a meta-analytic review. Clin J Pain 1992;49:
6. Cummings P, Rivara FP. Reporting statistical information in 221–30.
medical journal articles. Arch Pediatr Adolesc Med 2003;157: 17. Colvin DF, Bettinger R, Knapp R, et al. Characteristics of pa-
321– 4. tients with chronic pain. South Med J 1980;73:1020 –3.

A Lenda do Valor P

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie A Lenda do Valor P

Ähnlich wie A Lenda do Valor P (20)

Mehr von FUAD HAZIME

Mehr von FUAD HAZIME (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A Lenda do Valor P