1) The document discusses issues with relying solely on statistical significance (p-values) to determine clinical significance. While p-values indicate if results could be due to chance, they do not provide information on the size of the treatment effect or its clinical meaningfulness.
2) Effect sizes and confidence intervals provide a measure of the magnitude of the treatment effect but do not necessarily indicate clinical significance on their own.
3) The document argues that clinical significance should be determined based on external standards from patients and clinicians regarding what would constitute a meaningful improvement, rather than solely on statistical measures.
2. ANESTH ANALG EDITORIAL 1455
2005;101:1454 –6
issue has been addressed in multiple recent review Clinicians should be cautioned to not interpret mag-
articles and editorials in the general medical and psy- nitude of change (effect size) as an indication of clin-
chological literature (4 – 8). ical significance. The clinical significance of a treat-
In an attempt to address some of the limitations of ment should be based on external standards provided
the P value, the use of the confidence intervals (CI) has by patients and clinicians. That is, a small effect size
been advocated by some clinicians (9). One should may still be clinically significant and, likewise, a large
realize, however, that these two definitions of statisti- effect size may not be clinically significant, depending
cal significance are essentially reciprocal (10). That is, on what is being studied. Indeed, there is a growing
getting a P 0.05 is the same as having a 95% CI that recognition that traditional methods used, such as
does not overlap zero. CIs can also, however, be used statistical significance tests and effect sizes, should be
to estimate the size of difference between groups in supplemented with methods for determining clini-
addition to merely indicating the existence or absence cally significant changes. Although there is little con-
of statistical significance (11). This later approach, sensus about the criteria for these efficacy standards,
however, is not widely used in the medical and psy- the most prominent definitions of clinically significant
chological literature, and today CIs are mostly used as change include: 1) treated patients make a statistically
surrogates for the hypothesis test rather than consid- reliable improvement in the change scores; 2) treated
ering the full range of likely effect size. patients are empirically indistinguishable from a nor-
The group of statistics called “effect sizes” designate mal population after treatment, or 3) changes of at
indices that measure the magnitude of difference be- least one sd. The most frequently used method for
tween groups, controlling for variation within the evaluating the reliability of change scores is the
groups; effect sizes can be thought of as a standard- Jacobson-Truax method in combination with clinical
ized difference. In other words, although a P value cutoff points (15). Using this method, change is con-
denotes whether the difference between two groups in sidered reliable, or unlikely to be the product of meas-
a particular study is likely to occur solely by chance, urement error, if the reliable change index (RCI) is
the effect size quantifies the amount of difference be- more than 1.96. That is, when the individual has a
tween the two groups. Quantification of effect size change score more than 1.96, one can reasonably as-
does not rely on sample size but instead relies on the sume that the individual has improved.
strength of the intervention. There are a number of Unfortunately, most of the methods above are dif-
different types of effect sizes and a description of these ficult to adopt in the perioperative arena, as compar-
various types and formulae is beyond the scope of this ison with a normal population is not an option in most
editorial. We refer the interested reader to review trials, and the RCI, which controls for statistical issues
articles that describe the various types of effect sizes involving the assessment tool, is a somewhat compli-
and their calculation methodology (12,13). Effect sizes cated and controversial technique. Thus, clinical sig-
of the d type are the most commonly used in the nificance in the perioperative arena may be best as-
medical literature, as they are primarily used to com- sessed by posing a particular question such as “is a
pare two treatment groups. D type effect size is de- change of 8.5% reduction in intraoperative bleed clin-
fined as the magnitude of difference between two ically significant?” or “how many sd does this change
means, divided by the sd [(Mean of control group represent?” Obviously, both of these questions have a
Mean of treatment group)/sd of the control group]. subjective component in them and although it is tra-
Thus, the d effect size is dependent on variation ditionally agreed that at least a 1-sd change is gener-
within the control group and the differences between ally needed for clinical significance, this boundary has
the control and intervention groups. Values of the d no scientific underpinning. The validity of a clinical
type effect sizes range from to , where zero cutoff for these last two methods can be improved by
denotes no effect and values less than or more than establishing external validity (e.g., patient perspec-
zero are treated as absolute values when interpreting tive) for the decision. For example, Flor et al. (16) have
magnitude. Conventionally, d type effect sizes that are conducted a large meta-analysis that was aimed at
near 0.20 are interpreted as small, effect sizes near 0.50 evaluating the effectiveness of multidisciplinary reha-
are considered “medium,” and effect sizes in the range bilitation for chronic pain. The investigators found
of 0.80 are considered “large” (14). However, interpre- that pain among the patients who received the inter-
tation of the magnitude of an effect size depends on vention was indeed reduced by 25%. This reduction
the type of data gathered and the discipline involved. was certainly statistically significant and had an effect
Effect sizes of another type—the risk potency type— size of 0.7. Colvin et al. (17), however, reported earlier
include likelihood ratios such as odds ratio, risk ratio, that patients would consider only a 50% improvement
risk difference, and relative risk reduction. Clinicians in their pain levels as a treatment “success.” Thus, in
are probably more familiar with these less abstract this example, a reduction of 25% in pain scores may be
statistics and it may be helpful to realize that likeli- statistically, but not clinically, significant. Clearly this
hood statistics are a type of effect size. is a developing area that warrants further discussion.
3. 1456 EDITORIAL ANESTH ANALG
2005;101:1454 –6
In conclusion, we suggest that reporting of periop- 7. Greenstein G. Clinical versus statistical significance as they
relate to the efficacy of periodontal therapy. J Am Dent Assoc
erative medical research should continue beyond re- 2003;134:1168 –70.
porting results consisting primarily of descriptive and 8. Sterne JAC, Smith GD, Cox DR. Sifting the evidence: what’s
statistically significant or nonsignificant findings. The wrong with significance tests? Another comment on the role of
interpretation of findings should occur in the context statistical methods. BMJ 2001;322:226 –31.
9. Simon R. Confidence intervals for reporting results of clinical
of the magnitude of change that occurred and the trials. Ann Intern Med 1986;105:429 –35.
clinical significance of the findings. 10. Feinstein AR. P-values and confidence intervals: two sides of
the same unsatisfactory coin. J Clin Epidemiol 1998;51:355– 60.
11. Gardner MG, Altman DG. Confidence intervals rather than P
values: estimation rather than hypothesis testing. BMJ 1986;292:
References 746 –50.
1. Fisher RA. Statistical methods for research workers, 1st ed. 12. Kirk R. Practical significance: A concept whose time has come.
Edinburgh: Oliver and Boyd, 1925. Reprinted by Oxford Uni- Educ Psychol Meas 1996;56:746 –59.
versity Press. 13. Snyder P, Lawson S. Evaluating results using corrected and
2. Fisher RA. Design of experiments. 1st ed. Edinburgh: Oliver and uncorrected effect size estimates. J Exper Educ 1993;61:334 –349.
Boyd, 1935. Reprinted by Oxford University Press. 14. Cohen J. Statistical power analysis for the behavioral sciences,
3. Fisher RA. Statistical methods for research workers. London: 2nd ed. Mahwah, New Jersey: Lawrence Erlbaum, 1988.
Oliver and Boyd, 1950:80. 15. Jacobson NS, Truax P. Clinical significance: A statistical ap-
4. Borenstein M. Hypothesis testing and effect size estimation in proach to defining meaningful change in psychotherapy re-
clinical trials. Ann Allergy Asthma Immunol 1997;78:5–11. search. J Consult Clinic Psych 1991;59:12–9.
5. Matthey S. P 0.05: but is it clinically significant? Practical 16. Flor H, Fydrich T, Turk DC. Efficacy of multidisciplinary pain
examples for clinicians. Behav Change 1998;15:140 – 6. treatment centers: a meta-analytic review. Clin J Pain 1992;49:
6. Cummings P, Rivara FP. Reporting statistical information in 221–30.
medical journal articles. Arch Pediatr Adolesc Med 2003;157: 17. Colvin DF, Bettinger R, Knapp R, et al. Characteristics of pa-
321– 4. tients with chronic pain. South Med J 1980;73:1020 –3.