Bad reproducibility of experimental results becomes a systemic problem in biomedicine. One of the main reason of this is inadequate statistical analysis. Statistical analysis should be comprehensive harmonizing statistical evidences and predictions as well as frequentist and Bayesian approaches. It is insufficient to carry out the null hypothesis significance testing (NHST) reporting P-values. Statistical significance doesn’t mean clinical importance.
Effect size with confidence and prediction intervals should be reported. Experiments an/or observations should be repeated many-many times and their agreement should be investigated.
The best way is to repeat the experiments independently in different laboratories (in different countries).
1. International Life Sciences Workshop
“Decision-Making in Biomedical Science – Meet Experts”
September 12 – 16 | 2014
Potsdam | Germany
Harmonizing statistical evidences and
predictions
Nikita N. Khromov-Borisov
Pavlov First Saint Petersburg State Medical University
Saint Petersburg, Russia
Nikita.KhromovBorisov@gmail.com
+7 952-204-89-49; +7 921-449-29-05
http://independent.academia.edu/NikitaKhromovBorisov
https://www.researchgate.net/profile/Nikita_Khromov-Borisov?ev=hdr_xprf
1
2. Slides are freely available to all
Nikita N. Khromov-Borisov
Department of Physics, Mathematics and Informatics
Pavlov First Saint Petersburg State Medical University
Nikita.KhromovBorisov@gmail.com
+7-952-204-89-49; +7-921-449-29-05
http://independent.academia.edu/NikitaKhromovBorisov
2
3. The best way to discuss scientific issues is to
discuss them in a foreign language
Max Ludwig Henning Delbrück,
(September 4, 1906 – March 9, 1981)
Piotr Slonimski
(November 9, 1922 – April 25, 2009)
3
4. Second hand teaching
• The History of Science has suffered greatly from the use by
teachers of second-hand material, and the consequent
obliteration of the circumstances and the intellectual
atmosphere in which the great discoveries of the past were
made.
• A first-hand study is always instructive, and often . . . full of
surprises.
• Ronald A. Fisher, 1955
• Cited by: Ziliak S.T., McCloskey D.N. The Cult of Statistical
Significance: How the Standard Error Costs Us Jobs, Justice, and
Lives. The University of Michigan Press, Ann Arbor, 2008, 321 pp.
• http://stephentziliak.com/
4
6. The essences of science are
replication and reproducibility
• The essence of science is replication:
• a scientist should always be concerned about what would
happen if he or another scientist were to repeat his
experiment.
• Guttman L. What is not what in statistics. The Statistician,
1977; 26(2): 81-107.
• Scientists have elaborated method of determining the
validity of their results.
• They learned to ask the question: are they reproducible?
• Scherr G.H. Irreproducible Science: Editor’s Introduction.
• In The Best of the Journal of Irreproducible Results,
Workman
• Publishing, New York, 1983.
• Reproducibility is like the ghost that will always come back
to haunt you.
• http://datapede.blogspot.ru/2014/03/part-1z-p-value-surviving-mosquito.html
6
7. Loscalzo J. Irreproducible Experimental Results:
Causes, (Mis)interpretations, and Consequences.
Circulation, 2012; 125: 1211-1214.
• In Science what is relevant is reproducible results.
• If an initial observation is found to be reproducible,
then it must be true.
• If an initial observation is found not to be
reproducible, then it must be false.
• Many readers of scientific journals—especially of
higher-impact journals—assume that if a study is of
sufficient quality to pass the scrutiny of rigorous
reviewers, it must be true.
• This assumption is based on the inferred equivalence
of reproducibility and truth.
7
8. • Long ago Fisher . . . recognised that . . . solid
knowledge came from a demonstrated ability to
repeat experiments . . .
• This is unhappy for the investigator who would
like to settle things once and for all, but
consistent with the best accounts . . . of the
scientific method . . .
• Tukey J.W. The philosophy of multiple
comparisons. Statistical Science, 1991; 6: 100-
116.
8
9. Tukey J.W. Analyzing data: Sanctification
or detective work? American Psychologist,
1969; 24: 83–91.
• Nothing learned is certain.
• We learn by taking chances.
• Every modern learning theorist expects learning to be by trial,
with some errors.
• This is as true for science as for the individual.
• Confirmation comes from repetition.
• Repetition is the basis for judging varilability and significance and
confidence.
• Repetition of results, each significant, is the basis, according to
Fisher, of scientific truth.
• Certainty is an illusion.
• As an illusion, certainty can be wasteful, as well as misleading.
• Data analysis needs to be both exploratory and confirmatory.
9
10. From the history of epidemiological studies: Risk factors for cancer
[Jenks S., Volkers N. Razors and Refrigerators and Reindeer — Oh My!
JNCI, 1992; 84(24):1863]
• Using electric razor: Increase the risk of developing leukemia.
• Distal forearm fractures in women: Reduction in overall cancer
incidence, breast cancer incidence, and incidence of tumors.
• Fluorescent lighting: Melanoma in male but not in females.
• Allergies and cancer: At first the inverse relationship. Later several
types of cancer were elevated. However, ovarian cancer risk
decreased with increasing numbers of allergies.
• Breeding reindeer: in Swedish Lapps decreased risks for cancers of
the colon, female breast, male genital tract, kidneys, respiratory
system, and for lymphomas. However, increased risk for stomach
cancer.
10
11. From the history of epidemiological studies: Risk factors for cancer
[Jenks S., Volkers N. Razors and Refrigerators and Reindeer — Oh My! JNCI,
1992; 84(24): 1863]
• Waiters in Norway: Decreased risk of stomach cancer but excess risks of
cancers of the liver, rectum, upper respiratory and digestive tracts, and
lung. Higher mortality rate from lung cancer.
• Owning a pet bird: Fourfold increase in lung cancer risk among pigeon
fanciers (more hazardous than living with a smoker). Owners of budgies,
canaries, finches, or parrots were OK.
• Height: Lower risks for some cancers in short men, particularly colorectal
cancer, and lower risks for this cancer and for breast cancer in short
women. But being tall may confer some advantage for certain cancers
(esophageal, endometrial and cervical), while tall men have only a
slightly elevated risk for prostate, kidney and colon cancers.
• Refrigerators: Seems protect everyone from stomach cancer.
11
12. • An extensive list of curious and questionable
medical observations about the various risk
factors, was given in the work:
• Buchanan A.V., Weiss K.M., Fullerton S.M.
• Dissecting complex disease: the quest for the
Philosopher’s Stone?
• International Journal of Epidemiology 2006. –
Vol. 35. – P. 562–571
12
13. Table of irreproducible results?
• Hormone replacement therapy and heart
disease
• Hormone replacement therapy and cancer
• Stress and stomach ulcers
• Annual physical checkups and disease
prevention
• Behavioural disorders and their cause
• Diagnostic mammography and cancer
prevention
• Breast self-exam and cancer prevention
• Echinacea and colds
• Vitamin C and colds
• Baby aspirin and heart disease prevention
• Dietary salt and hypertension
• Dietary fat and heart disease
• Dietary calcium and bone strength
• Obesity and disease
• Dietary fibre and colon cancer
• The food pyramid and nutrient RDAs
• Cholesterol and heart disease
• Homocysteine and heart disease
• Inflammation and heart disease
• Olive oil and breast cancer
• Fidgeting and obesity
• Sun and cancer
• Mercury and autism
• Obstetric practice and schizophrenia
• Mothering patterns and schizophrenia
• Anything else and schizophrenia
• Red wine (but not white, and not grape juice)
and heart disease
• Syphilis and genes
• Mothering patterns and autism
• Breast feeding and asthma
• Bottle feeding and asthma
• Anything and asthma
• Power transformers and leukaemia
• Nuclear power plants and leukaemia
• Cell phones and brain tumours
• Vitamin antioxidants and cancer, aging
• HMOs and reduced health care cost
• HMOs and healthier Americans
• Genes and you name it!
13
14. ‘Blood group mythology’: myths about AB0
• Human blood group system AB0 can serve as an classic example of
unacknowledged associations with the different conditions.
• Several incredible phenomenon were reported:
• Persons with A have more severe hangovers;
• Persons with B defecate the most;
• Persons with 0 have more healthy teeth;
• Military with 0 are spineless and with B are more impulsive;
• Persons with B are more prone to crime;
• Strong connection between AB0 and nutrition;
• Persons with A2 have the highest IQ;
• A is significantly more common among members of the higher socio-economic
groups.
• All these associations are not reproduced and virtually forgotten.
14
15. • Large companies in Japan still use blood types
when advertising for, or evaluating, job
applicants.
• George Garratty
• Association of Blood Groups and Disease: Do
Blood Group Antigens and Antibodies Have a
Biological Role?
• History and Philosophy of the Life Sciences,
1996; Vol. 18, No. 3, The First Genetic Marker, p.
321-344.
15
16. • The only associations between AB0 blood
groups and malignant neoplasms,
thrombosis, peptic ulcers, bleeding, bacterial
and viral infections are still regarded as
statistically “proven“.
• Alas, these associations have no clinical
(practical) importance due to low values of
odds ratio (OR) which do not exceed the
value of OR = 1.5.
16
17. Associations between AB0 blood groups and diseases,
which are still considered to be statistically “proven”
Medical condition A > 0 0 > A B/AB > A/0 OR
Malignancy X 1.2 – 1.3
Thrombosis X
Peptic ulcers X 1.2 – 1.4
Bleeding X 1.5
E. coli / Salmonella X
Note that here we meet extremely important issue of clinical (or
any other practical) importance (significance) of the observed
associations. Here clinical importance is demonstrated with one
of the measures of the effect size such as odds ratio (OR).
17
18. Begley C.G., Ellis L.M. Raise standards for preclinical
cancer research. Nature, 2012; 483: 531-533.
• Recently Glenn Begley, former vice president of the
well-known biotech company Amgen, and his colleague
Lee Ellis published the results of their efforts to replicate findings
from recent publications in the clinical oncology literature.
• The data were disturbing.
• Of 53 papers, only 6 (11%) were reproducible.
• Begley and Ellis state that the
• poor reproducibility of the results becomes a systemic problem of
modern science.
• In one study, which was cited in a short period more
than 1900 times, even the authors themselves later were
unable to reproduce their own results.
18
19. Increasing replication of un-reproducibility in science
• Gautam Naik: Scientists'
Elusive Goal:
Reproducing Study
Results. The Wall Street
Journal, December 2,
2011.
• This is one of medicine’s
dirty secrets:
• Most results, including
those that appear in top-flight
peer-reviewed
journals, can’t be
reproduced.
19
20. Macleod M.R., Michie S., Roberts I., Dirnagl U., Chalmers I., Ioannidis J.P.A.,
Al-Shahi Salman R., Chan A.-W., Glasziou P. Biomedical research: increasing
value, reducing waste. The Lancet, 2014, 383(9912): 101-104
• Of 1575 reports about cancer prognostic markers
published in 2005, 1509 (96%) detailed at least one
significant prognostic variable.
• However, few identified biomarkers have been
confirmed by subsequent research and few have
entered routine clinical practice.
• This pattern — initially promising findings not leading
to improvements in health care — has been recorded
across biomedical research.
• So why is research that might transform health care
and reduce health problems not being successfully
produced?
20
21. Ioannidis J.P.A.
Why most published
research findings are false.
PLoS Med., 2005. – Vol. 2. –
No. 8. – Paper: e124.
Cited by 2174
21
23. • PLOS ONE Launches Reproducibility Initiative
• http://validation.scienceexchange.com/#/
• Reproducibility Initiative receives $1.3M grant to validate 50 landmark
cancer studies
• Reproducibility Project: Psychology
• https://osf.io/ezcuj/wiki/home/
• Special Section on Replicability in Psychological Science
• Perspectives on Psychological Science, 2012; 7(6): 528 –530
23
24. • Journal of Negative Results in BioMedicine is
an open access, peer-reviewed, online
journal that provides a platform for the
publication and discussion of unexpected,
controversial, provocative and/or negative
results in the context of current tenets.
• Editor-in-Chief
• Bjorn R Olsen, Harvard Medical School
24
25. Challenges in irreproducible research
• No research paper can ever be considered to be the
final word, and the replication and corroboration of
research results is key to the scientific process.
• In studying complex entities, especially animals and
human beings, the complexity of the system and of
the techniques can all too easily lead to results that
seem robust in the lab, and valid to editors and
referees of journals, but which do not stand the test
of further studies.
• http://www.nature.com/nature/focus/reproducibility/index.html
25
26. Statistics
“A subject which most statisticians
find difficult but in which nearly all
physicians are expert.”
26
27. • Statistical flaws are a major cause of irreproducible
results in all types of biomedical experimentation.
• These include errors in trial design, data analysis, and
data interpretation.
• “If experimentation is the Queen of the sciences,
surely statistical methods must be regarded as the
Guardian of the Royal Virtue.”
• Myron Tribus
(Letter to Science)
27
28. Statistical Babel
• Unfortunately, statisticians speak different languages , and often
do not hear and/or do not understand each other.
• Two main approaches to the statistical inference are developing:
• Bayesian and
• Frequentist
• Frequentist inference is subdivided onto two main branches:
• Fisherian and
• Neyman-Pearsonian
• Users do not always differentiate them that leads to serious
confusions.
• Two other approaches are also exist: Likelihood and Fiducial
inferences.
• http://en.wikipedia.org/wiki/Frequentist_inference
28
30. Fundamental statistics principles
• Random sampling is the main principle of statistics.
• Randomness and the Law of Large Numbers ensure
the sample representativeness.
• A sample is called representative if it reflects correctly
the distribution from which the sample is taken.
• The main objective of statistics consists in analyzing
random samples to get conclusions on the
distributions from which they are drawn.
• Note that we do not need the term “population”
which can be misleading.
30
31. Statistics with confidence
• Does Statistics enable us to trust to it?
• For instance, how to check is the die
perfect (fair, ideal, symmetric) or not?
• The answer is provided by the Law of
Large Numbers.
31
32. Simulation of the rolling a die: program SUStats
http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
A die was rolled 100 times in each of four independent simulations.
Please, answer three questions:
1. Are the results of the rolling reproducible (i.e. are the histograms similar)?
- Yes
- No
2. What a form (shape) of the histogram and the underlying distribution we expect
32
for the results of rolling fair die?
- Unimodal of a bell-form
- Triangle
- Uniform (rectangular)
3. Can we state that the die is fair?
- Yes
- No
33. Simulation of the rolling a die: program SUStats
http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
33
A die was rolled 1 000 times in each of four independent simulations.
Please, answer two questions:
1. Are the results of the rolling reproducible (are the histograms similar)?
- Yes
- No
2. Can we state that the die is certainly fair?
- Yes
- No
34. Simulation of the rolling a die: program SUStats
http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
34
A die was rolled 10 000 times in each of four independent simulations.
Please, answer two questions:
1. Are the results of the rolling reproducible (are the histograms similar)?
- Yes
- No
2. Can we state that the die is certainly fair (the histograms are certainly
rectangular and the entire distribution is uniform)?
- Yes
- No
35. Simulation of the rolling a die: program SUStats
http://www.jsc.nildram.co.uk/examples/sustats/diescore/DieScoreApplet.html
35
Pease, keep in mind the last figure (number) n = 10 000 that gives reliable
results. It is difficult to realize it in biomedicine, but it’s really reliable.
36. Lyrical digression
• If to ponder, it is the
• Pauli exclusion principle
• that provides a variety of forms
• of matter at all levels,
• from atoms to living beings,
• e.g., genetic and phenotypic (biochemical,
physiological, morphological) variations.
36
37. Sample size
“She thought that a smaller sample
size makes for more accurate results”
37
38. Sample sizes in physics, chemistry, biology and
medicine
• Physicists and chemists works with the samples of different
substances which contain 6∙1023 (the Avogadro constant) of
particles (atoms or molecules) in 1 mole of the pure substance.
• Even 1 nanomole of given substance contains about 1014 such
particles.
• These particles may be regarded as rather identical.
• However, we need not to forget that even on the atomic level
there are several isotopes of a given chemical element.
• And some of them are radioactive.
• In medicine researchers are limited with the size of the world
population which is less then 1010, specifically, about 7.257∙109.
• See real-time: http://www.worldometers.info/world-population/
• And human population are extremely heterogeneous.
38
39. Principal contradiction
• All people are dissimilar, even monozygotic (“identical”)
twins.
• In such twins the differences in copy number variation
(CNV), immunoglobulins, fingerprints are observed.
• Surely this fact is one of the main sources of the low
reproducibility and predictive ability of the results in
biomedicine.
• Thus, the genetic and phenotypic uniqueness of each person
comes into contradiction with the statistical methodology,
which requires to analyze large amounts (thousands or at
least hundreds) of identical persons to achieve the certain
conclusions.
39
40. What is the Low of Large Numbers?
• If the probability P(A) of an event A is constant in all trials, then the larger n -
the number of trials (experiments, sample size),
• the closer the observed (empirical, experimental) relative frequency, f(A), of
a given outcome (event) A converges to its expected (theoretical) probability
P(A):
f A P A P
n
• This means that the frequencies become more and more stable and their
fluctuations become smaller and smaller.
• Corollary:
• Thus, we may not know the probability of an event A, but repeating the trial
as much as possible, we can accept its observed frequency f(A) as a reliable
statistical estimate of the unknown probability P(A)unkn.
• Statistics helps us to know the unknown.
• In Probability Theory probabilities are known, Statistics estimate them.
40
41. “Reverse side” of the Law of Large Numbers
• Simultaneously along with the convergence of the frequency
of an event A to its probability, the situation, when the
frequency of the event will coincide exactly with its
probability:
• becomes less probable
• i.e. the larger the number of trials the closer the probability
of such an exact match converges to zero:
41
f A PA
Pr f A P A P
0
n
42. Probability of the exact coincidence of the frequency f(A) with
the probability P(A), e.g., fair coin tossing with P(A) = φ = 0,5
f(A)
• 5/10
• 50/100
• 500/1 000
• 5 000/10 000
• 50 000/100 000
• 500 000/1 000 000
P[f(A)]
• 0.25
• 0.080
• 0.025
• 0.0080
• 0.0025
• 0.00080
42
For the sake of clarity, the probability values are rounded to
two significant figures.
43. Consequences of the Law of Large Numbers
(LLN)
• According to the Law of Large Numbers the larger the
Sample Size,
n
• the “better” (more accurate, more reliable) the Sample data
reflects the distribution of Random Variable from which the
Sample is drawn.
• Consequently, the larger the sample size, the more
representative is the Sample.
• This is true, however, if and only if (iff) the Sample data are
the realizations of the independent identically distributed
(iid) Random Variables.
43
45. What are the main objectives of statistics?
• Statistical Estimation (of the parameters)
• Point and interval estimations
• Statistical Inference
– Testing Statistical Hypotheses
– Comparison of Models
• Statistical Associations
• Correlation and Regression
45
46. What is Estimator and what is Estimate?
• An “Estimator“ is a statistic that is used to infer the value of
an unknown parameter in a statistical model.
• The parameter being estimated is sometimes called
the estimand.
• In other words, an estimator is a rule for calculating an
estimate of a given quantity based on observed data:
• thus the rule and its result (the estimate) are distinguished.
46
47. Two main kinds of Statistical Estimates
• Point Estimate – estimation by a single
number.
• Intreval Estimate – estimation by an interval,
which covers the value of the estimated
parameter with given probability called
confidence level.
47
48. The main logic of Statistical Estimation: Point
Estimates
• Usually the parameter φunkn is unknown.
• The objective is to estimate it on the basis of observed statistical data
• x1, x2, …, xi, …, xn.
• The above values are regarded as realizations of corresponding iid
random variables:
• X1, X2, …, Xi, …, Xn.
• Appropriate function of these random variable is chosen as an Estimator
for the unknown parameter.
• Any such function is called “Statistic” and it also is a random variable.
• Calculated values of a chosen Estimator are called Estimates.
• Estimate is regarded as a realizations of given Estimator.
48
49. Compression of statistical information
• One of the most widely used statistic is a sample
mean which plays a role of the Estimate of the
mean value of the underlying distribution.
• It is calculated as:
n
i
i x
1
n
M
1
• And it is generated by the Estimator:
n
~ 1 ~
i
i X
n
M
1
• Here tilde “~” is a symbol of a random variable.
49
51. • Let consider one of the most common
problem of statistical analysis of two
independent samples.
51
52. IUGR – intrauterine growth restriction
(old name “intrauterine growth retardation”)
• Foetuses of birth weight less than 10th percentile of those born at
same gestational age
• or
• two standard deviations below the population mean are considered
growth restricted.
• Note that the difiniton is based on statistical terms: 10th percentile
and/or standard deviations.
• More strictly IUGR should refer to foetuses that are small for gestational
age and display other signs of chronic hypoxia or failure to thrive.
• Approximately 3-5% of all pregnancies.
• IUGR also known as SGA (small for gestational age).
52
56. 56
Levels of induced production of INF-α/β in 16 healthy mothers of
healthy newborns and in 20 mothers of newborns with IUGR
(intrauterine growth restriction) (Koroleva L.I.). Data are ranked.
Healthy IUGR
Rank
IFN-α/β,
IU/ml
Rank
IFN-α/β,
IU/ml
Rank
IFN-α/β,
IU/ml
Rank
IFN-α/β,
IU/ml
1 38 9 92 1 104 11 144
2 42 10 93 2 121 12 146
3 58 11 94 3 123 13 147
4 59 12 101 4 123 14 149
5 70 13 103 5 127 15 151
6 71 14 115 6 130 16 153
7 81 15 159 7 132 17 162
8 86 16 170 8 134 18 168
9 134 19 171
10 140 20 173
Only three highlighted values in healthy group are overlapped with the values in
IUGR group. Level of INF-a/b in IUGR group stochastically dominates that in healthy.
57. Exploratory and Pictorial Statistics.
Visualization of the initial data and
their preliminary statistical
descriptions:
histograms, box plots, dominance
diagrams, etc.
57
58. 58
Comparisons of histograms for the levels of induced production of
INF-α/β in 16 healthy mothers of healthy newborns and in 20
mothers of newborns with IUGR. Free program PAST
http://folk.uio.no/ohammer/past
59. Comparisons of histograms and cumulative sample distributions for the levels of
induced production of INF-α/β in 16 healthy mothers of healthy newborns and in 20
mothers of newborns with IUGR.
Program XLSTAT http:www.xlstat.com
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200
Cumulative relative frequency
Cumulative distributions
(Healthy / IUGR)
Healthy IUGR
0.025
0.02
0.015
0.01
0.005
0
Histograms (IFN-a/b, IU/mL)
0 50 100 150 200
Density
IFN-a/b, IU/mL
Healthy Normal(89.500,36.471)
IUGR Normal(141.600,18.323)
59
60. CDF – cumulative distribution functions and stochastic
dominance
Program XLSTAT http:www.xlstat.com
• The level of induced IFN-a/
b in IUGR patients (green
line) stochastically
dominates that for healthy
mothers (blue line):
• X2 > X1
• Stochastic - randomly
determined; having a
random probability
distribution or pattern that
may be analyzed
statistically but may not be
predicted precisely.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 50 100 150 200
Cumulative relative frequency
Cumulative distributions
(IUGR / Healthy)
IUGR Healthy
60
61. Box-and-Whisker plot
Q1 – first quartile, Q3 – third quartile, IQR – interquartile range, σ – standard deviation.
61
62. Box-and-whisker plot for the levels of induced production of
IFN-/ in 16 healthy mothers of healthy newborns and in 20
mothers of newborns with IUGR. Free program: Instat+
http://www.reading.ac.uk/ssc/n/n_instat.htm
62
Marks for
outliers
medians
95% confidence limits for medians
What did the Box Plot say to the outlier? "Don't you dare get close to my whisker!!"
63. What is outlier?
• Outlier is an observation that is numerically distant from the rest of the
data.
• They are often indicative of measurement (or registration) errors.
• For example, if for the arterial blood pressure the value 1100 is
registered, this could be misprint: either 1 or 0 is rather redundant.
• Removing of outlier(s) is a controversial practice recommended in
several textbooks and manuals.
• However, the possibility should be considered that the underlying
distribution for the data is not approximately normal, having "fat (heavy)
tails“ or representing a mixture of two or more different distributions.
• Mixture may comprise two identical distributions, but shifted relative to
each other.
• Thus, removing of outlier(s) have to be based on the extra-statistical
considerations.
• “I'm not an outlier; I just haven't found my distribution yet!”
63
64. Mixture analysis
Program PAST
Component
proportion
Mean, M
Standard
Deviation, SD
0.88 78.8 22.5
0.12 164.5 5.5
Data in healthy group can be regarded as a
mixture of two normal distributions.
Their proportions are 88% and 12%.
The major component has sample mean
about M = 79 IU/mL and standard deviation
SD = 23 IU/mL.
The minor component has M = 165 IU/mL
and standard deviation SD = 5.5 IU/mL.
However, the sample size (n1 = 16) is too
small to get certain conclusion.
64
66. • Recommendations for the Conduct, Reporting, Editing, and Publication
of Scholarly Work in Medical Journals. Updated December 2013.
• iii. Statistics
• Describe statistical methods with enough detail to enable a
knowledgeable reader with access to the original data to judge its
appropriateness for the study and to verify the reported results.
• When possible, quantify findings and present them with appropriate
indicators of measurement error or uncertainty (such as confidence
intervals).
• Avoid relying solely on statistical hypothesis testing, such as P values,
which fail to convey important information about effect size and
precision of estimates.
• http://www.icmje.org/recommendations/
• Prediction probabilities and prediction intervals should be added.
66
67. • Over 300 medical and biomedical journals
are guided with the ICMJE recommendations.
67
68. Effect Size, ES
• Question of the clinical (practical) importance of the observed
• Effect Size (ES)
• is a key when interpreting results of biomedical investigations (e.g., clinical
trials).
• Effect Size is defined as a quantitative reflection of the magnitude of some
phenomenon that is used for the purpose of addressing a question of
interest.
• Kelley K., Preacher K.J. On Effect Size. Psychological Methods, 2012; 17(2):
137–152
• ES can be the difference between mean values, different kind of ratios,
correlation, association etc.
• ES can be expressed either in the real measurement units, or
• as standardized (nonmetric) quantity.
68
69. • Analyzing samples we get conclusions on the
distributions from which they are drawn.
• In the case of comparing two independent
distributions the simplest and useful measure of the
effect size is AUC (or AUROC) – Area Under (ROC-)
Curve which relates to Mann-Whitney U-statistics.
• One of its representation is so-called dominance
diagram.
69
71. Dominance diagram
Program XLSTAT http:www.xlstat.com
Healthy
Dominance diagram
IUGR
71
Umin = 35 is a number of “plus” signs, and Umax = 285 is a number of “minus” signs,
and obviously: Umin + Umax = 35 + 285 = n1 × n2 = 16 × 20 = 320
72. • For two independent random variables X and Y ,
• Θ = P(Y > X) + 1/2 P(Y = X)
• is advocated as a general measure of effect size to characterize the
degree of separation (or, conversely, overlap) of their distributions.
• It is estimated by statistic
• θ AUC = Umax / (n1 × n2),
• derived by dividing the larger observed value Umax of the Mann–Whitney
statistic by the product of the two sample sizes.
• It is equivalent to the observed value of AUC - area under the receiver
operating characteristic (ROC) curve.
• It has been termed the ‘probability of concordance’, ‘common language
effect size’ and ‘measure of stochastic superiority’.
72
73. AUC - area under (ROC-) curve
• In given rectangular matrix the total cell number
is a product of the two sample sizes:
• n1 n2 = 20 16 = 320
• The observed maximum value of two additive
components of Mann-Whitney U-statistics is the
number of yellow cells in the matrix:
• Umax = 285
• So the point estimate for AUC is:
• AUC = Umax / (n1 n2) = 285/320 = 0.89
73
74. Interval estimation
Researchers should wherever
possible, base discussion and
interpretation of results on point
and interval estimates
74
75. What is Confidence Interval?
• Frequentist’s Confidence Interval is a random
interval that covers the estimated (unknown)
value of a given Parameter with the specified
probability.
• Such probability is called confidence level (or
confidence coefficient).
75
76. CI
• If the experiment is repeated several times, the observed
values for the limits of the Confidence Interval calculated
from the observations will vary from sample to sample.
• Frequently, with the probability (1 - ), it will include (cover)
the estimated unknown value of parameter, but with the
probability it will inevitably miss the estimated value.
• How frequently the observed interval contains the
parameter is determined by the confidence level (or
confidence coefficient).
• Confidence level is chosen by the researcher in accordance
with his intuition.
76
78. The meaning of the Confidence Level
• The meaning of the term “confidence level” is that, if
confidence intervals are constructed across many separate
data analyses of repeated (and possibly different)
experiments, the proportion of such intervals that contain
the true value of the parameter will approximately match
the confidence level.
• So, e.g., the 95% does not attach to the one frequentist CI,
it attaches to “the proportion of such intervals”.
• When only single CI is obtained, it is unknown whether it
is true or not.
• Again, we come to a conclusion about the need to repeat
the experiment many times.
78
80. Significance Level α and
Confidence Level (1 – α)
Significance
level,
Confidence
level,
(1 - )
Reliability
0.05 95% Low
0.01 99% Medium
0.001 99.9% High
80
81. Confidence interval and statistical significance
Expected value of θ 100(1 – α)% CI for the unknown value θunkn:
Unknown estimated by given interval
value θunkn does not differ statistically
from the expected value θ.
Unknown estimated by given interval
value θunkn is statistically significantly
larger than the expected value θ at the
significance level α.
Unknown estimated by given interval
value θunkn is statistically significantly
smaller than the expected value θ at the
significance level α.
81
82. Statistical significance and practical (clinical) importance
Estimated unknown difference is
statistically nonsignificant and
clinically unimportant
CI is too wide; perhaps sample size
is too small
Estimated unknown difference is
statistically significant, but
clinically unimportant
Estimated unknown difference is
statistically significant and
clinically important
Expected “null” value CI
82
82
Clinically indifferent zone or
reference interval
83. Compact form for the
joint presentation
of point and interval estimations
• Example:
– AUC point estimation: 0.89
– Lower limit of the 95% CI: 0.72
– Upper limit of the 99% CI: 0.96
• Compact record:
• AUC θ = 0.720.890.96
• Louis T.A., Zeger S.L. Effective communication of standard errors
and confidence intervals. Biostatistics, 2009; 10(1): 1–2.
• Newcombe’s spreadsheet: GENERALISEDMW.XLS
http://medicine.cf.ac.uk/primary-care-public-health/resources/
83
84. Statistical inference
using confidence interval
• Obtained 95% confidence interval (CI) does not cover the
indifferent value AUCindiff = 0.5.
• This means that the unknown value of AUCunkn estimated with this
interval statistically significantly differs from the indifferent value
AUCindiff = 0.5 (under the significance level α = 0.05).
• Consequently, we can conclude that one of two comparing
random variables stochastically dominates another.
• When the shapes of both distributions are similar we can
interpret this result as the statistically significant deviation of the
estimated Hodges-Lehmann shift parameter from its indifferent
value ΔHLindiff = 0.
84
85. • Strictly speaking, widespread interpretation of the
Mann-Whitney U-statistic as a measure of the
difference between medians of the two comparing
distributions is incorrect.
• Mann-Whitney statistic is the measure of stochastic
dominance of one of two independent distributions
(not their medians).
• When the shapes of both distribution are similar,
than Mann-Whitney statistic becomes the basis for
estimating the Hodges-Lehmann shift parameter.
85
87. Applying nonparametric confidence interval for the shift parameter to the
comparison of the induced production of IFN-/ in healthy group and group with
IUGR. Program StatXact http://www.cytel.com/software-solutions/statxact
• Resulting Nonparametric Hodges-Lehmann point and
interval estimates of the shift parameter are:
• ΔHL = 385674 IU/mL
• This 95% confidence interval doesn’t cover the
indifferent value of the shift Δindiff = 0.
• So estimated with this interval unknown value of the
shift Δunkn statistically significantly differs from 0 at
the significance level α = 0,05.
• Therefore the induced production IFN-α/β in IUGR
group is statistically significantly higher than in
healthy group.
87
88. Applying parametric confidence interval for the mean difference to the comparison
of the induced production of IFN-/ in healthy group and group with IUGR.
Free Program ESCI JSMS.xls http://www.latrobe.edu.au/psy/esci/
• Parametric point and interval estimates
of the difference of two means are:
• Δ = 335271 IU/mL
• This 95% confidence interval doesn’t
cover the indifferent value Δindiff = 0.
• So estimated with this interval
unknown value of the difference Δunkn
statistically significantly differs from 0
at the significance level α = 0,05.
• Therefore the induced production IFN-
α/β in IUGR group is statistically
significantly higher than in healthy
group.
88
ES Δ = 33.152.171.0 IU/mL;
dC = 1.87; Student t = 5.58
89. Visualization of the comparison two meand using confidence
interval for the mean difference Free Program ESCI JSMS.xls
http://www.latrobe.edu.au/psy/esci/
• Presented 95% confidence
interval (rose triangle and
vertical segment) for the mean
difference doesn’t cover the
indifferent value Δindiff = 0.
• So estimated with this interval
unknown value of the difference
Δunkn statistically significantly
differs from 0 at the significance
level α = 0.05.
• Therefore the induced
production IFN-α/β in IUGR
group is statistically significantly
higher than in healthy group.
89
Blue circles are observed values. Black dots
and vertical segments are point and interval
estimates of the unknown means. Rose
triangle and vertical segment are estimates
of their unknown difference.
90. Newcombe’s standardized
effect size: δN or StAUC
• When σ1 = σ2 = σ, θ reduces to
• Φ(δN /√2)
• that is expressed in terms of the standard deviation σ.
• Here Φ is common notation for the CDF (Cumulative
Density Function) of the standard Gaussian (normal)
distribution.
• θ is more preferable than δN, as it is less
depends on distributional assumptions, thus
more satisfactory than the standardized
difference.
90
94. Verbal scale for the interpretation of the
standardized Cohen’s effect size
Standardized Cohen’s
effect size, dC
Interpretation
0 – 0,5 Negligibly small (worthless)
0,5 – 1,0 Small (weak)
1,0 – 1,5 Moderate
1,5 – 2,0 Large (strong)
2,0 – 3,0 Very large (very strong)
3,0 - Extremely large
94
95. Once more: Statistical significance and
the Effect size
• Effect (difference, association, correlation, risk,
benefit, etc.) can be statistically significant,
however, its practical (e.g., clinical) importance can
appeared to be worthless.
• “Statistically significant” does not imply
“substantial”, “practically important”, “valuable”.
• Effects can be real, nonrandom, but nonetheless,
negligibly small.
95
96. Confidence interval for the Standardized
Cohen’s Effect Size dC. Free Program LePrep
http://www.univ-rouen.fr/LMRS/Persopage/Lecoutre/PAC.htm
96
97. Results: point estimates and 95% confidence
intervals for the three main effect sizes
• AUC – area under the ROC-curve:
• AUC = 0.720.890.96
• StAUC – Newcombe’s standardized AUC:
• StAUC = δN = 0.81.72.5
• StES – Cohen’s standardized difference of means:
• StES = dC = 1.11.92.7
• Verbal interpretation:
• with probability 95% the estimated unknown effect
sizes can be interpreted as from medium to very large
(strong).
97
99. Repeat!
• Often it is believed that if the “statistically
significant” result is obtained, this excludes the
need of repeating the experiment.
• Testing the significance of statistical
hypotheses is a method to detect rare events
which deserve further investigation.
• Fisher
99
100. Cumming G. The New Statistics:
Why and How. Psychological Science,
2014; 25(1): 7 –29.
• Three problems are central:
• Published research is a biased selection of all
researches;
• data analysis and reporting are
often elective and biased; and
• in many research fields, studies are
rarely replicated, so false
conclusions persist.
100
101. Replication
• A single study is rarely, if ever, definitive; additional
related evidences are required.
• Such evidences may come from a close replication,
which, with meta-analysis, should give more reliable
estimates than the original study.
• A more general replication may increase reliability
and also provide evidence of generality or robustness
of the original finding.
• We need increased recognition of the value of both
close and more general replications, and greater
opportunities to report them.
101
102. Reproducibility and predictive ability of P-values and
confidence intervals (n = 32). CI dance.
Free program “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/.
Cumming G. Replication and p intervals: p values predict the future only vaguely, but
confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.
102
103. • Thus, it is risky to rich definite conclusion
from a single experiment only.
• Any scientific investigation should be
repeated manifold.
• And a reproducibility of the results must be
studied.
103
104. Gigerenzer G. We need statistical thinking,
not rituals. Behavioral and Brain Sciences,
1998; 21(2): 199-200
• A researcher cannot be unconcerned about:
• “what would happen if additional subjects were to be included into the
experiment?”,
• “what would be the conclusion for the data of these future subjects?”,
• “what would be the conclusion for the whole data?”, or
• “what would happen if this experiment were to be repeated?”
• Asking and answering such questions goes beyond the ritualized
statistical procedures, and is likely to influence the way the authors of
scientific papers interpret experimental findings and conduct their
experiments.
• Prediction probabilities are an unavoidable part of statistical thinking
and the time is come to take them seriously.
104
105. Prediction and confidence intervals.
Program Instat+ http://www.reading.ac.uk/ssc/n/n_instat.htm
105
106. Reproducibility of the absolute effect size ES for the healthy
and IUGR groups at α = 0.05 and (1 – α) = 0.95
106
95% confidence interval for ES Δ is from 33 to 71 IU/mL;
95% prediction interval for it is wider: from 25 to 78 IU/mL.
107. 10-fold increasing sample size
107
If we will repeat the experiment 10 times independently, the prediction
interval will become narrower and closer to the confidence level.
108. Prediction interval versus confidence interval
• Note that under 10-fold repetition of the
experiment the 95% prediction interval
becomes closer the observed 95% confidence
interval.
• This is demonstration of the meaning of
confidence interval as that one which covers
the estimated effect size under manifold
(infinite) repetitions of the experiment.
108
109. Reproducibility of the standardized Cohen’s effect size dC for
the healthy and IUGR groups at α = 0.05 and (1 – α) = 0.95
109
95% confidence interval for StES dC is from 1.1 to 2.7 IU/mL;
95% prediction interval for it is wider: from 0.8 to 3.1 IU/mL.
110. 10-fold increasing sample size
110
If we will repeat the experiment 10 times independently, the prediction
interval will become narrower and closer to the confidence level.
111. Prediction probabilities, Prep, Psrep and Preprep
111
Probability of a same-sign effect is Prep = 1.0; of a same-sign and significant at α
= 0.05 is Psrep = 0.99 and of a same-sign effect with Prep = 0.99 is Preprep = 0.98.
112. Reproducibility of the P-value when comparing healthy and IUGR
groups at α = 0.05 and (1 – α) = 0.95
112
Observed Pval = 3∙10-6. 95% prediction interval for it will be from extremely
small from 3∙10-11 to the moderate 0.01.
113. Probabilities of replication and prediction intervals
• Thus, it is predicted that when our experiment will be
repeated, than the probability to receive the same sign for
the mean difference (expressed as absolute effect size ES as
well as Cohen’s standardized effect size dC) will be
• Prep = 1.00.
• And the probability to receive the difference of the same
sign and statistically significant at the level α = 0.05 will be
• Psrep = 0.99.
• Moreover, it is predicted that in future repetition of the
experiment, the P-value could lie in very wide 95%
prediction interval from very low to rather medium:
• Pval = 3∙10-11 to Pval = 0.01.
113
114. Main statistical tools and their destination
• Bayes Factor (BF) → comparing statistical
models and/or hypotheses
• P-value → statistical hypothesis testing
• Effect Size (ES) → practical (clinical) importance
• Confidence intervals (CI) → visualization of both,
the estimates and the hypotheses testing
• Prediction Intervals (PI) → prediction of future
repetitions
114
115. Bayes theorem in action:
connecting prior and posterior
probabilities
115
117. 117
Bayes Factor
• Bayes factor differs principally from P-value (Рval).
• Base factor is not a probability in itself, but a ratio
of probabilities, and it can vary from zero to infinity:
• BF01 = P(Dobs|H0) / P(Dobs|H1)
• BF10 = P(Dobs|H1) / P(Dobs|H0)
• This means that using Bayes factor provide not only
testing the significance of the null hypothesis, but
comparison of the probabilities to obtain the
observed data under both hypotheses.
• However, for this we should have a better idea
of the alternative hypothesis.
119. What are the odds?
• The odds (in favor) of an event A is the ratio of
the probability that the event will happen
P(A) to the probability that the event will not happen P(Ā):
• O(A) = P(A) : P(Ā) = P(A) : [1 – P(A)]
• Conversely, the odds against an event A is the opposite
ratio.
• Such a representation of the probability is familiar to
geneticists.
• Famous Mendel’s ratio of 3 : 1 is a representation of the
probabilities 3/4 and 1/4 in terms of odds.
119
120. Bayes factor BF in terms of odds
• Base factor not only shows how many times the probability
P(Dobs|H0) differs from the probability P(Dobs|H1).
• It also shows how many times the posterior odds in favor of one
hypothesis against the other (alternative) differ from their a prior
odds.
• Conversely,
|
P H D
1 obs
|
P D H
obs 0
:
10 • BF01 = 1/BF10
P H
• Thus, we observe an amazing property of Bayes factor:
• without knowing prior and posterior probabilities of both
hypotheses, we can quantitatively compare their odds.
120
1
0
0 obs
obs 0
P H
P H D
P D H
BF
|
|
121. Interpretation of credibility of Bayes factors
BF10 and BF01
121
BF01
Evidence in favor of hypothesis Н0 against
hypothesis Н1
>10 000 Convincing
100 – 1 000 Very strong
30 – 100 Strong
10 – 30 Moderate
3 – 10 Weak
1 – 3 Negligible
BF10
Evidence in favor of hypothesis Н1 against
hypothesis Н0
124. Comparison of the frequentist and Bayesian results
• Testing homogeneity (independence) of the Arbuthnot data results
in:
• Pval ≈ 10-8
• BF01 = 8∙10117
• From the frequentist point of view the heterogeneity of Arbuthnot
data is statistically highly significant.
• From the Bayesian point of view the conclusion is diametrically
opposite:
• To obtain such data is 8∙10117 times more likely under the hypothesis
H0 on their homogeneity then under the alternative hypothesis H1
on their heterogeneity.
• Or:
• The posterior odds in favor of the null hypothesis against alternative
hypothesis are 8∙10117 times higher then their prior odds.
124
125. Bayes Factor, online program Bayes Factor Calculators
http://pcl.missouri.edu/bayesfactor
125
126. Output
• BF01 = 0.00018 and
• BF10 = 1/ BF01 = 5555.5
• It is 5555 times more likely
to obtain the value of the
Student t-test statistic t =
5.58 with df = 34 under the
H1: 0 than under H0: =
0.
• According to the verbal
scale such value of BF10 is
interpreted as convincing
evidence in favor of H1
against H0.
126
127. Summary
Statistical evidences
• AUC θ = 0.720.890.96
• StAUC δN = 0.81.72.5
• StES dC = 1.11.92.7
• ΔHL = 385674 IU/mL
• Δ = 335271 IU/mL
• BF10= 5555
• Pval = 3∙10-6
Statistical predictions
• 95% prediction intervals:
• From 0.8 to 3.1 IU/mL
• From 25 to 79 IU/mL
• From 3∙10-11 to 0.010
• Probability of replication:
• Psrep = 0.99
127
129. Castoldi E., Rosing J. Thrombin generation tests. Thrombosis
Research, 2011; 127(Suppl. 3): S21–S25
• Parameters of the
thrombin generation curve:
• LT – lag time, min
• TTP – time to peak, min
• PT – peak thrombin, nM
• ETP – endogenous
thrombin potential,
nM∙min
• V – maximum velocity of
thrombin generation,
V = PT / (TTP – LT), nM/min
129
130. Estimation of parameters of TGT, results of traditional NHST
and effect sizes. n1 = 40, n2 = 53
LT, min ETP, nM∙min TTP, min PT, nM V, nM/min
RI 8.0 – 27.4 1290 – 2480 17 – 41 85 – 192 5.3 – 25.4
M1 14 16 17 1820 1900 1990 25 27 28 125 134 144 11 13 15
M2 15 17 19 1640 1740 1830 29 31 33 100 106 113 7.1 7.9 8.7
Pval 0.37 0.015 0.0012 3∙10-6 10-8
Effect sizes
ΔHL -3.3 -1.0 1.2 52 188 323 -7.3 -4.6 -1.8 14 28 40 3.3 4.6 6.0
SE Δ -3.4 -1.3 0.7 43 167 294 -7.1 -4.5 -2.1 17 28 39 3.4 5.1 6.7
AUC θ 0.44 0.55 0.67 0.55 0.67 0.77 0.68 0.70 0.79 0.66 0.77 0.85 0.73 0.83 0.90
StAUC δN -0.61 -0.20 0.22 0.19 0.63 1.04 -1.13 -0.72 -0.28 0.53 1.06 1.48 0.89 1.36 1.80
StES dC -0.66 -0.25 0.16 0.10 0.52 0.94 -1.15 -0.73 -0.30 0.65 1.09 1.53 0.89 1.35 1.80
n1 and n2 – sample sizes of the control and CAD groups; RI – nonparametric reference
interval; М1 and М2 – sample means; Pval – P-value; ΔHL – Hodges-Lehmann shift
estimate; Δ = М1 – М2 – effect size in real units; θ - area under ROC-curve; δN and dC
– Newcombe’s and Cohen’s standardized effect sizes.
Programs: Reference Value Advisor, PAST, StatXact, GENERALIZED.xls, ESCI-JSMS.xls,
LePrep.
130
131. Informativeness of TGT parameters
53 CHD patients and 40 people without clinical manifestations of
coronary heart disease (data by Berezovskaya G.A.)
dC – standardized Cohen’s effect size, Pval – Р-value, BF10 – Bayes factor for
comparison of odds in favor of H1 versus H0, Psrep – probability of statistically
significant effect of the same sign (direction) in a replication, Power – “achieved”
power, n1 = n2 – minimum sample sizes for replication. Programs: ESCI-JSMS.xls,
Online BF Calculator (http://pcl.missouri.edu/bayesfactor), LePrep, G*Power
131
132. Syndrome of statistical leniency and
credulity
Fallacies and Confusions of Null
Hypothesis Significance Testing
(NHST) and P-value
“What does a statistician call it when the
heads of 10 rats are cut off and 1 survives?
- Nonsignificant.”
132
133. P-value
• P-value is the most controversial concept in statistics.
• Many textbook authors and the majority of experimenters do not
understand what its final product – a P-value – actually means
(Gigerenzer, 1988).
• The concept of a P-value lies so far from the intuitive
understanding that no ordinary person can hold it in memory.
• ‘‘We rely too much on P values, and most of us really don’t have a
clue what they mean.’’
• Lai J., Fidler F., Cumming G. Subjective p intervals: Researchers
underestimate the variability of p values over replication.
Methodology: European Journal of Research Methods for the
Behavioral and Social Sciences, 2012; 8: 51-62.
133
134. What is P-value? What is null hypothesis H0?
• A P-value is the probability of observing data as or more
extreme as the actual outcome when the null hypothesis
is true.
• When testing null hypothesis we transform data into a
test statistic.
• Then the P-value is the probability of obtaining a test
statistic at least as extreme as the one that was actually
observed, assuming that the null hypothesis is true.
• Usually the null hypothesis is a statement of 'no effect' or
'no difference'.
• The Null Hypothesis is often denoted H0 (read “H-nought”)
134
135. Null Hypothesis Significance Testing Waltz
• The P value is at the heart of the most common approach to data
analysis – Null Hypothesis Significance Testing (NHST).
• Think of NHST as a waltz with three steps:
• (i) State a null hypothesis: that is, there is no effect.
• (ii) Calculate the p value, which is the probability of getting results
like ours and more extreme – if the null hypothesis is true.
• (iii) If Pval is sufficiently small, reject the null hypothesis and sound
the trumpets:
• our effect is not zero, it's statistically significant!
• Generations of students have been inducted into
the rituals of .05 meaning "significant",
and .01 "highly significant".
135
136. Р-value, Рval
• Thus, by definition, the P-value (Pval) is the conditional probability of obtaining the
observed value of difference (dobs) and all other larger or less probable values
(D ≥ dobs|H0), when the null hypothesis is true:
• Pval = P(D ≥ dobs|H0).
• In terms of the statistical hypothesis testing, P-value is:
• The probability to obtain the modulus of observed value |tobs| of the test statistic T
and all other larger or less probable values (i.e., the values even more deviating from
the expected one)
• under assumption that the null hypothesis H0 is true:
•
• Pval = P(|T| ≥ |tobs.| | H0).
• Note that the “less probable values” are not observed.
• We infer them out of all possible values in the frame of the chosen (null) model.
136
137. • A P-value is usually interpreted as a measure of
how much evidence we have against the null
hypothesis, how much is contradiction between
null hypothesis and observed data.
• The null hypothesis, traditionally represented by
the symbol H0, represents the hypothesis of no
change or no effect.
• The smaller the P-value, the more (stronger)
evidence we have against H0.
137
138. What is Test Statistic?
• Test statistic is a statistic used for the testing the given null
hypothesis.
• Example: Student t-test statistic:
M ~ M ~
• In such a case testing the null hypothesis H0 on the equality of two
independent means (H0: M1 – M2 = 0) is reduced to the testing the
null hypothesis on the t = 0.
• When this hypothesis is true, than the distribution of the t-statistic
is known.
• Namely, it is the Student t-distribution.
• This distribution has a single parameter called degrees of
freedom, df.
1 2 2
1 2
1 2
, df n n
s ~
t ~
M M
138
139. William Sealy Gosset (June 13, 1876–October 16, 1937) is famous as
a statistician, best known by his pen name Student and for his work
on Student's t-distribution.
139
140. n1 = 5, n2 = 7, df = 10, t = 1,5
P = 0,16 – the difference is statistically nonsignificant
140
http://ftparmy.com/103097-decision-visualizer.html
141. n1 = 5, n2 = 7, df = 10, t = 3,0
P = 0,013 – the difference is statistically significant at
the significance level α = 0,05, but not at 0,01
141
142. Searching the threshold for the P-value: is it possible?
• When small P-value is observed, the intuitive
(extrastatistical) temptation appears to reject null
hypothesis H0.
• However, there is no statistical reason what P-value
would be regarded as sufficiently small to reject H0
safely.
• Once again, such decision is extrastatistical.
• In practice, decision to reject or accept H0 must
depend on circumstances.
• In each specific (concrete) situation researcher
should make her/his choice by oneself.
142
143. 143
Traditional interpretation
of the P-values (Pval)
(and their Michelin star scale)
P-value (Pval) Statistical significance Michelin stars
> 0,05 Nonsignificant
0,05 – 0,01 Moderately significant *
0,01 – 0,001 Significant **
0,001 – 0,0001 Highly significant ***
< 0,0001 Extremely significant ****
Four stars value 0,0001 was introduced recently by Harvey J. Motulsky:
http://www.graphpad.com/guides/prism/6/statistics/index.htm?interpreting_a_small_p_value_from_an_unpaired_t_test.htm
144. Tyranny and/or hypnosis of the figures
0.05 and 95%
• Unfortunately, as a threshold the significance level
α = 0.05 is most commonly used.
• Too often the overcoming this threshold level
(Pval < 0.05) solely in a single experiment is regarded
as sufficient for the decision to reject the null
hypothesis and conclude on the statistical
significance of the observed effect.
144
145. Andrey Nikolaevich Kolmogorov
(25 April 1903 – 20 October 1987)
• In statistics, the recommended
significance level varies from
0.05 for preliminary orientation
experiments to 0.001 for
important ultimate conclusions,
but the attainable reliability of
probability conclusions is often
much higher.
• Thus, the principal conclusions of
statistical physics are based on
the neglect of probabilities of an
order less than 10−10.
• (1951)
145
http://www.encyclopediaofmath.org/index.php/Probability
146. Sterne J.A.C., Davey Smith G.
Sifting the evidence –
what’s wrong with significance tests?
BMJ, 2001; 322: 227-231. Cited by 763
• Presently, several other authors echo to Kolmogorov:
• P-value closer to 0.05 is not a strong evidence against null
hypothesis.
• As a strong evidence against Н0 Pval < 0.001 should be
regarded.
• In addition to P-values it is strongly recommended to
present confidence intervals for the effect size.
146
147. “Flexible” P-values
• In fact no scientific worker has a fixed level of
significance at which from year to year, and
in all circumstances, he rejects hypotheses;
• he rather gives his mind to each particular
case in the light of his evidence and his ideas.
•
• Fisher R. A. Statistical Methods and Scientific Inference,
1956, pages 41-42.
147
149. Warrning
• Usually P-value is interpreted as a measure for the
evidence given by the available data against the null
hypothesis.
• Strictly speaking, however, it is not a measure in
mathematical sense.
• It does not possess the additivity property, and
moreover,
• it does not satisfy to two the more important principle
of the statistical theory – The Likelihood Principle and
the P-postulate.
149
150. Likelihood Principle
• Verbosely, the Likelihood Principle is a statement
that statistical analysis must operate with that and
only that data which are actually obtained in the
experiment.
• However, for the calculation of Р-value (as it
follows from its definition), not only the observed
experimental data are used, but all other, less
probable, which were not observed in fact.
150
151. Р-postulate
• To serve as real and adequate measure of
the statistical evidence, Р-value should
satisfy the simple rule (postulate) according to
which the same Р-values have to present equal
evidences against the null hypothesis.
• This rule is called «Р-postulate».
• Obviously, this minimal requirement is not met.
•
• Wagenmakers E.-J. A practical solution to the pervasive
problems of p values. Psychonomic Bulletin & Review, 2007;
14(5): 779-804.
151
152. Р-postulate
• Intuitively one can recognize that Рval = 0.01 in the
experiment with 10 observations will not
demonstrate the same evidential strength as
Рval = 0.01 in the experiment with 300 observations.
• Equally, Рval = 0.001, obtained in one experiment
and Рval = 0.01 in another does not imply that the
effect observed in the first experiment is 10 times
more evidential than in the second.
152
153. P-value is the realization of corresponding
random variable P*
• P-value is an observed value of the corresponding
random variable
• P*
• When null hypothesis H0 is true, then Pval has so
called (continuous) standard uniform distribution,
that is uniform distribution on the interval [0; 1]:
• P* ~ Uni[0; 1].
153
154. P-value distributions
Pike N. free spreadsheet: FDR.xls http://www.webcitation.org/5rxSzU7qL
Δ = μ1 – μ2 = 0;
Δ = μ1 – μ2 = 10;
χ2 = 390,6; df = 400; Pval = 0,62
χ2 = 1348,8; df = 400; Pval = 4∙10-101
154
120
100
80
60
40
20
0
Frequency distribution of p-values
Observed frequency Expected frequency
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Frequency of values in range
p-value defining upper limit of range
16
14
12
10
8
6
4
2
0
Frequency distribution of p-values
Observed frequency Expected frequency
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
Frequency of values in range
p-value defining upper limit of range
These are histograms obtained with 200 simulations.
155. Reproducibility and predictive ability of P-values and 95%
confidence intervals (n = 32). Dance of Pval
Free program “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/.
Cumming G. Replication and p intervals: p values predict the future only vaguely, but
confidence intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.
155
156. Reproducibility and predictive ability of P-values and 95%
confidence intervals (n = 32). Dance of Pval
Free spreadsheet “ESCI PPS p intervals” http://www.latrobe.edu.au/psy/esci/. Cumming
G. Replication and p intervals: p values predict the future only vaguely, but confidence
intervals do much better. Persp. Psychol. Sci., 2008; 3: 286-300.
156
157. Reproducibility of the P-value when comparing healthy and IUGR
groups at α = 0.05 and (1 – α) = 0.95
157
Observed Pval = 3∙10-6. 95% prediction interval for it will be from extremely
small from 3∙10-11 to the moderate 0.01.
158. Popular temptation
• It is conventional to interpret the quintessence of
traditional (frequentist) conclusions from the
statistical hypotheses testing as:
• The less P-value, the stronger is evidence (which is
presented by the data) against null hypothesis H0
the bigger is a reason to doubt in H0.
• Hence, whether intentionally or not (and seems
rather naturally), the temptation appears to
interpret P-value as a probability of the null
hypothesis.
158
159. Popular delusion
• P-value is not a probability of null hypothesis!
• P-value is calculated
• under the assumption
• that null hypothesis H0 is true:
• Pval = P(|D| ≥ |dobs||H0),
• Hence, P-value cannot be a probability of null
hypothesis:
• P{D|H0} ≠ P{H0|D}
• Collection of other fallacies about P-value see, e.g.:
• http://en.wikipedia.org/wiki/P-value
• Goodman S. A dirty dozen: Twelve P-value
misconceptions. Semin. Hematol., 2008; 45: 135-140
159
160. Calibration of P-values
• Vovk V. G. A logic of probability, with application to the foundations of statistics. Journal of
the Royal Statistical Society. Series B (Methodological), 1993; 55(2): 317-351.
• Sellke T., Bayarri M.J., Berger J.O. Calibration of p values for testing precise null hypotheses.
The American Statistician, 2001; 55(1): 62-71. Cited by 321
• When
BF
|
P H D
0 1 BF
01
01
• - lower bound for the probability of the null hypothesis H0
160
P 1 e val
01 val val BF eP lnP
161. 161
The “price” of P-values
Observed
P-value
Upper limit of
80% intreval for
Pval
Lower limit for the
probability of hull
hypothesis
P(H0)
Upper limit for the
probabililty of
repeat
Рrepr
0.05 0.44 ≥ 29% < 50%
0.01 0.22 ≥ 11% < 73%
0.001 0.07 ≥ 1.8% < 90%
Sellke T., Bayarri M.J., Berger J.O. Calibration of p values for testing precise null hypotheses. The
American Statistician, Vol. 55, No. 1. (2001), pp. 62-71.
Goodman S.N. A comment on replication, p-values and evidence // Statistics in Medicine, 1992.
– Vol. 11. – P. 875-879.
Cumming G. Replication and p intervals: p values predict the future only vaguely, but confidence
intervals do much better // Perspectives on Psychological Science, 2008. – Vol. 3. – No. 4. – P.
186-300.
162. The problem with p values: how significant are they, really?
November 12th, 2013 Geoff Cumming
http://phys.org/wire-news/145707973/the-problem-with-p-values-how-significant-are-they-really.html
A p value of 0.05 has been the default ‘significance’ threshold for nearly 90
years … but is that standard too weak? Martin_Heigan
162
163. Funny metaphor
• “Perhaps p values are like mosquitos.
• They have an evolutionary niche somewhere
and no amount of scratching, swatting, or
spraying will dislodge them”.
• Campbell J.P. Editorial: Some remarks from
the outgoing editor. Journal of Applied
Psychology, 1982; 67: 691-700
163
164. • The usefulness of P-values is quite limited,
and we continue to suggest that these
procedures be euthanized.
• Anderson D.R., Burnham K.P. Avoiding pitfalls
when using information-theoretic methods.
The Journal of Wildlife Management, 2002;
66(3): 912-918.
164
165. On seduction:
• Yes, the P-value can seduce.
• It is sexy and we can be blinded.
• A significant P-value can perplex our thinking, where we simply get
too excited and forget to look at the actual effect size.
• Does that < 0.05 really matter when the effect size is small?
• The study which concluded that the "internet is changing the
dynamics and outcomes of marriage itself“ can be an example.
• This study showed that those who meet their spouses online are less
likely to divorce and more likely to have high marital satisfaction (of
course with very significant P-values).
• However, the effect size was very very small where happiness, for
example, barely moved from 5.48 to 5.64.
• So, do not sign up for match.com thinking that you may be happier
with your spouse.
165
170. Revised standards for statistical evidence
• A simple strategy for improving the replicability of scientific
research includes the following steps:
• (i) Associate statistically significant test results with P values
that are less than 0.005.
• (ii) Associate highly significant test results with P values that
are less than 0.001 (cf. Kolmogorov) and even 0.0001.
• (iii) Report the Bayes factor in favor of the alternative
hypothesis and the default alternative hypothesis that was
tested.
170
171. Revised standards for statistical evidence
• (iv) BF10 > 30 or even > 100 should be
considered as strong and convincing evidence
in favor of alternative hypothesis H1.
• Proposed modifications of common
standards of evidence intend to reduce the
rate of nonreproducibility of scientific results
by a factor of 5 or greater.
• Certainly, the larger sample sizes are
required.
171
172. Minimum sizes for two independent samples with non-overlapping
values required to achieve the lower confidence
limits for two measures of the effect size: AUCL and SESL
Lower confidence limits for
the effect size measured
with:
Confidence levels
AUCL StAUCL 0.95 0.99 0.999
0.80 1.2 10 17 27
0.90 1.8 21 35 56
0.95 2.3 40 69 111
0.99 3.3 194 334 545
0.999 4.4 1923 3320 5418
Extrapolated using Newcombe’s free spreadsheet VISUALISETHETA.xls
http://medicine.cf.ac.uk/primary-care-public-health/resources/
172
173. Джон Уайлдер Тьюки (John Wilder Tukey, 16.04.1915 — 26.07.2000)
• Any research should be at
least two-staged.
• First stage – exploratory
(preliminary, pilot,
hypotheses generating)
study.
• Second stage – confirmatory
study.
• The second stage is designed
on the basis of the results
obtained at the first stage.
173
174. Conclusions
• Bad reproducibility of experimental results
becomes a systemic problem in biomedicine.
• One of the main reason of this is inadequate
statistical analysis.
• Statistical analysis should be comprehensive
harmonizing statistical evidences and predictions as
well as frequentist and Bayesian approaches.
• It is insufficient to carry out the null hypothesis
significance testing (NHST) reporting P-values.
174
175. Conclusions (continued)
• Statistical significance doesn’t mean clinical
importance.
• Effect size with confidence and prediction intervals
should be reported.
• Experiments an/or observations should be repeated
many-many times and their agreement should be
investigated.
• The best way is to repeat the experiments
independently in different laboratories (in different
countries).
175
176. Editorial politics
• Journal editors and reviewers should not accept for
publications the papers if they report results of a single
experiment and no results of the independent replication.
• Experts on statistics should be included in the editorial
boards.
• Reviewers should be obliged to re-examine all the
calculations.
• For this reason the free access to the initial (“raw”) data
should be ensure.
• Transparency and openness are cornerstones of the
scientific method.
176
177. Francis Galton, 1901
• “I have begun to think that no one ought to
publish biometric results, without lodging a
well-arranged and well-bound copy of his data in some
place where it should be accessible, under reasonable
restrictions, to those who desire to verify his work.”
• Galton F. Biometry. Biometrika, 1901; 1(1): 7-10.
• Galton’s suggestion of a store data had been
revived by Professor Julian Huxley, and
suggestion made for storing measurements
in the British Museum of Natural History.
177
178. • One of the most
common and leading to
the biggest disaster of
temptations is
tempting with the
words: "Everybody
does it"
• Leo Tolstoy
178
180. 180
Lesaffre E., Lawson A. Bayesian
Biostatistics. Bayesian Biostatistics.
2012. Wiley. 534 p.
Broemeling L.D. Bayesian Biostatistics
and Diagnostic Medicine. 2007. CRC
Press, 216 p.
181. 181
Kruschke J. Doing Bayesian Data Analysis. 2010. Academic Press, 672 p.
182. Downey A.B. Think Bayes: Bayesian Statistics
Made Simple. Version 1.0.1, 2012. Green Tea
Press: Needham, Massachusetts, 195 p.
182
Albert J. Bayesian Computation with R.
Series: Use R! 2nd ed. 2009, Springer,
299 p.
184. Commercial Software
• StatXact http://www.cytel.com/software-solutions/statxact
• XLStat http:www.xlstat.com
• MedCalc https://www.medcalc.org/
• GraphPad Prism http://www.graphpad.com/
• StatsDirect http://www.statsdirect.com/
• Expensive monsters:
• SAS http://www.sas.com/en_us/home.html
• IBM SPSS http://www-01.ibm.com/software/analytics/spss/
• STATISTICA http://www.statsoft.com/
• John C. Pezzullo’s comprehensive list of statistical software:
http://statpages.org/
184
185. Thank you for your attention
Slides are freely available to all
Nikita N. Khromov-Borisov
Department of Physics, Mathematics and Informatics
Pavlov First Saint Petersburg State Medical University
Nikita.KhromovBorisov@gmail.com
+7-952-204-89-49; +7-921-449-29-05
http://independent.academia.edu/NikitaKhromovBorisov
185