Statistics for Anaesthesiologists covers basic to intermediate level statistics for researchers especially commonly used study designs or tests in Anaesthesiology research.
2. Recommended Software
• RStudio (GUI) with R, R Commander, R Commander
Plugins like EZR (Free, Cross platform, powerful
programming paradigm)
• G*Power (Free, for power analysis)
• SPSS (Commercial, expensive)
• SOFA (Free, basic)
• Graphpad.com
• Spreadsheet software like MS Excel for initial data entry
(export as CSV file format)
3. Data Types
• Nominal or Categorical data
• Ordinal data
• Interval data
• Ratio data
4. Data Types
Nominal: Categorical data and numbers that are simply used
as identifiers or names. Ex: social security (Aadhar) number
Ordinal: an ordered series of relationships or rank order. Ex:
first, second, or third place in a contest, Likert scale
Interval: A scale that represents quantity and has equal units
but for which zero represents simply an additional point of
measurement.. Ex: Fahrenheit scale
Ratio: similar to the interval scale. However, this scale also has
an absolute zero (no numbers exist below zero). Ex: Height,
Weight
7. Reporting data types
OK to compute
Nominal
Ordinal
Interval
Ratio
Frequency
Distribution
Yes
Yes
Yes
Yes
Median,
percentiles
No
Yes
Yes
Yes
Mean, SD, SE of No
mean
No
Yes
Yes
Ratio or
coefficient of
variation
No
No
Yes
No
8. Tests for normality of data
• Kolmogorov-Smirnov Test – inferior to others,
relies on goodness of fit of a sample with a normal
distribution curve, avoid its use!
• Shapiro-Wilk Test – better, mores specific, more
powerful especially with small sample sizes,
available in Rcommander, SPSS (under menu
Analyze>Descriptive Statistics>Explore)
9. Tests for normality of data
• D'Agostino-Pearson test
• Anderson-Darling test
• Q-Q (Quantile Probability) Plot – visual guide
• Histogram – inferior, look for Skew or Kurtosis
• Density Plot – better, look for Skew or Kurtosis
10. Choosing a statistical test
• Make sure you have adequate sample size (power)
to reject null hypothesis (Ho)
• Check is it one (only < or > μ, only one direction)
or two-tailed comparison (≠μ , test significance at
both sides) – in general use 2
• Look at your data types – ordinal, interval etc
• Do descriptive statistics testing
11. Choosing a statistical test
• Test normality of data – tests and visual
comparison (especially when n<30)
• Decide to use Parametric Vs Non-parametric tests
• Look at number of groups 2 or more – t-tests (if
n<30), z-test (n>30) or ANOVA (F-test) or their
non-parametric equivalents
• For 2 or more groups check if data is paired or
independent
14. What is p-value?
• The p-value is a probability of the test statistic’s sampling
distribution under the null hypothesis (null distribution, we
first assume Ho is true!)
• The (left-tailed) p-value is the quantile of the value of the
test statistic, the right-tailed p-value is one minus the
quantile, while the two-tailed p-value is twice whichever of
these is smaller.
• The p-value is NOT the probability that the null hypothesis
is true, nor is it the probability that the alternative
hypothesis is false
15. What is p-value?
• p-value is NOT the same as α !
• p-value is NOT the probability of rejecting the null
hypothesis (we reject Ho when p-value is less than
the significance level which is α)
• p-value is computed while α is set by experimental
design
• If Ho is true, α is the probability of rejecting null
hypothesis
16.
17.
18.
19.
20.
21. CHI SQUARE OR FISHER’S EXACT TEST?
• In the days before computers were readily
available, people analyzed contingency tables by
hand, or using a calculator, using chi-square tests
• Works by computing the expected values for each
cell if the relative risk (or odds' ratio) were 1.0. It
then combines the discrepancies between
observed and expected values into a chi-square
statistic from which a P value is computed
22.
23.
24.
25. CHI SQUARE OR FISHER’S EXACT TEST?
• The chi-square test is only an approximation!
• Yates continuity correction is designed to make it
better, but it over corrects so gives a p-value that
is too large (too 'conservative’)
• With large sample sizes, Yates' correction makes
little difference, and the chi-square test works very
well. With small sample sizes, chi-square is not
accurate, with or without Yates' correction
26. CHI SQUARE OR FISHER’S EXACT TEST?
• Fisher's exact test, as its name implies, always gives an
exact P value and works fine with small sample sizes
• Fisher's test (unlike chi-square) is very hard to calculate by
hand (so generally used for 2 x 2 or 2 x n table), but is easy
to compute with a computer
• Advisable to use when any cell of the table has expected
value < 5
27. CHI SQUARE OR FISHER’S EXACT TEST?
• Most statistical books advise using it instead of chi-square
test (especially small samples, but chi square becomes
acceptable for large sample sizes)
• Fisher’s exact test can be used for a m x n table
• Some have criticized it as the exact answer to the wrong
question!
30. ANOVA (ANALYSIS OF VARIANCE)
• The one-way analysis of variance (ANOVA) is used to
determine whether there are any significant differences
between the means of two or more independent
(unrelated) groups
• For ex: to understand if exam performance (dependent
variable) differed based on test anxiety levels amongst
students, dividing students into three independent groups
(e.g., low, medium and high-stressed students)
33. ANOVA (ANALYSIS OF VARIANCE)
• It is an omnibus test statistic and cannot tell you which
specific groups were significantly different from each
other; it only tells you that at least two groups were
different.
• Since you may have ≥3 groups in your study design,
determining which of these groups differ from each other
is done using a Post-hoc test (Tukey’s test is preferred)
which gives a Multiple comparisons table.
34. ANOVA (ANALYSIS OF VARIANCE)
• To apply ANOVA 6 assumptions must be met:
• Assumption #1: Your dependent variable should be
measured at the interval or ratio level (i.e., they are
continuous)
• Assumption #2: Your independent variable should
consist of two or more categorical, independent groups; it
can be used for just two groups (but an independentsamples t-test is more commonly used for two groups)
35. ANOVA (ANALYSIS OF VARIANCE)
• Assumption #3: You should have independence of observations,
which means that there is no relationship between the
observations in each group or between the groups themselves.
• Assumption #4: There should be no significant outliers.
• Assumption #5: Your dependent variable should be approximately
normally distributed for each category of the independent variable
(but it is quite "robust" to violations of normality)
• Assumption #6: There needs to be homogeneity of variances. (in
SPSS using Levene's test for homogeneity of variances)
36. ANOVA (ANALYSIS OF VARIANCE) METHOD
• ANOVA calculates the mean for each of the groups - the
Group Means.
• It calculates the mean for all the groups combined - the
Overall Mean.
• Then it calculates, within each group, the total deviation
of each individual's score from the Group Mean - Within
Group (Error )Variation.
37. ANOVA (ANALYSIS OF VARIANCE) METHOD
• Next, it calculates the deviation of each Group Mean
from the Overall Mean - Between Group Variation.
• Finally, ANOVA produces the F statistic which is the
ratio Between Group Variation to the Within Group
(Error) Variation.
39. TWO-WAY ANOVA DESIGN
Treatment/Conditi
on (Independent)
Levels (Independent Variable)
Group3
S6 DV
S11 DV
S2 DV
S7 DV
S12 DV
S3 DV
S8 DV
S13 DV
S4 DV
S9 DV
S14 DV
S5 DV
S10 DV
S15 DV
S16 DV
S21 DV
S26DV
S17 DV
CONDITION2
Group2
S1 DV
CONDITION1
Group1
S22 DV
S27 DV
S18 DV
S23 DV
S28 DV
S19 DV
S24 DV
S29 DV
S20 DV
S25 DV
S30 DV
40. ANCOVA (ANALYSIS OF COVARIANCE)
• An extension of the one-way ANOVA used to determine whether
there are any significant differences between the means of two or
more independent (unrelated) groups (specifically, the adjusted
means) by adjusting for a third or confounding variable
• Third variable (known as a "covariate” or “confounding variable”) is
that you want to "statistically control” that maybe affecting results
of ANOVA
• In each one of the two groups we can compute the correlation
coefficient between the third variable and dependent variables
41. REPEATED MEASURES ANOVA
• A repeated measures ANOVA is used when you have a
single group on which you have measured something a few
times
• For example, you may have a test of understanding of
Classes. You give this test at the beginning of the topic, at
the end of the topic and then at the end of the subject
• You would use a one-way repeated measures ANOVA to see
if student performance on the test changed over time
42. REPEATED MEASURES ANOVA
• Repeated measures ANOVA is the equivalent of the one-way
ANOVA, but for related, not independent groups, and is the
extension of the dependent t-test
• A repeated measures ANOVA is also referred to as a withinsubjects ANOVA or ANOVA for correlated samples
• The major advantage with running a repeated measures ANOVA
over an independent ANOVA is that the test is generally much
more powerful. This particular advantage is achieved by the
reduction in variability (due to differences between subjects) during
the performance of the test
46. Variable type & CHOOSING A Test
Explanatory
Variable
Response
Variable
Methods
Categorical
Categorical
Contingency
Tables
Categorical
Quantitative
ANOVA
Quantitative
Quantitative
Regression
47. ANOVA – WHY NOT JUST USE t-TESTS?
• Multiple t-tests are not the answer because as the number of
groups grows, the number of needed pair comparisons grows
quickly. For example in 7 groups there are 21 pairs. If we test 21
pairs we should not be surprised to observe things that happen
only 5% of the time. Thus in 21 pairings, a p-value = 0.05 for one
pair cannot be considered significant.
• Our level of significance α has to be divided for multiple
comparisons (Ex: for above it becomes α/21)
•
ANOVA puts all the data into one number (F) and gives us one pvalue for the null hypothesis.
48. ANOVA – WHY NOT JUST USE t-TESTS?
From eBook: Research skills for Psychology Majors by
William Gabrenya
50. Likert ITEM & LIKERT Scale
• Likert scale consists of multiple Likert-type
items
• Likert-type scales (such as "On a scale of 1
to 10, with one being no pain and ten being
high pain, how much pain are you in today?")
• Represent ordinal data (order, rank, but
no real distance)
51. Likert ITEM & LIKERT Scale
• Fundamentally, these scales do not represent a
measurable quantity
• An individual may respond 8 and be in less pain
than someone else who responded 5
• A person may not be in exactly half as much pain if
they responded 4 than if they responded 8
• Visual Analog Scale is a Likert scale but often
(wrongly) analyzed as if it were continuous data
52. COMPOSITE SCORE & LIKERT Scale
• Composite scores combine multiple Likert item
scales into a single scale
• Composite scores must first be analyzed for
internal consistency and inter-item correlation for
each item and reported (ex: using Cronbach’s
alpha – scale reliability analysis)
• These scores represent ordinal data so must use
non-parametric tests and descriptives
53. Cronbach’s Alpha For scales
• Check for internal consistency and overall
validity of a multiple Likert-type item scale
• Check correlation (α) with each item
deleted at a time
• Based on number of items and comparison
of its variances
54. Cronbach’s Alpha For scales
• Values of α range from 0 to 1
• Ideally overall α and α for each item (when
deleted from scale) must be > 0.7 to 0.8
• Clinical scores need higher α > 0.8 to 0.9
(Bland-Altman)
55. Power analysis & effect size
• To calculate sample size (n) we must know the type
of statistical test involved in our primary outcome
measure
• Also we must also know:
• Desired α error (usually taken as 0.05)
• Power (1-β) usually taken 0.8 (80%) or greater
• Two or one-tailed comparison
• Effect size
56. Power analysis & effect size
• Power is the fraction of experiments that you expect to
yield a "statistically significant” p-value (80% of
experiments of the sample may yield a significant p-value)
• Effect size (Cohen’s d for mean) depends on study design,
it is calculated by data from pilot studies or reference
studies
• Effect size depends on a clinically defined level of
significance (ex: more than 20% difference between 2
groups, with difference for proportion or mean ± SD data
etc)
57. Power analysis & effect size
• Cohen’s d is usually calculated based on pilot studies but if
effect size is unknown Jacob Cohen provided 3 guess estimate
effect sizes (value varies slightly for different statistical tests):
1. Small effect d around 0.2 (requires large sample sizes)
2. Medium effect d around 0.5 (seen with careful observation, use
when in doubt)
3. Large effect greater than 0.8 (if large it is obvious)
• Criticized when d is used as above as “T-shirt” effect sizes
58. Power analysis & effect size
• Calculation of required sample size a with set target for
power before starting the final study is called A priori
analysis (before the fact) – accepted method, especially
important to avoid incorrectly being “blind” to a real
difference in a negative study (due to large βerror)
• Calculation of required sample size at the end of the final
study is called Post hoc analysis (after the fact) – incorrect
as the computed power is a simple reflection of the pvalue!
• G*Power software is a free useful resource
Hinweis der Redaktion
Nominal: Categorical data and numbers that are simply used as identifiers or names represent a nominal scale of measurement. Numbers on the back of a football jersey and your social security (Aadhar) number are examples.
Ordinal: An ordinal scale of measurement represents an ordered series of relationships or rank order. Individuals competing in a contest may be fortunate to achieve first, second, or third place.
Likert-type scales (such as "On a scale of 1 to 10, with one being no pain and ten being high pain, how much pain are you in today?") also represent ordinal data. Fundamentally, these scales do not represent a measurable quantity. An individual may respond 8 to this question and be in less pain than someone else who responded 5. A person may not be in exactly half as much pain if they responded 4 than if they responded 8.
Interval: A scale that represents quantity and has equal units but for which zero represents simply an additional point of measurement is an interval scale. The Fahrenheit scale is a clear example of the interval scale of measurement. Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit represent interval data. Zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers both above and below it (for example, -10degrees Fahrenheit).
Ratio: The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has equality of units. However, this scale also has an absolute zero (no numbers exist below zero). Very often, physical measures will represent ratio data (for example, height and weight). If one is measuring the length of a piece of wood in centimeters, there is quantity, equal units, and that measure cannot go below zero centimeters.
Parametric means that it meets certain requirements with respect to parameters of the population (for example, the data will be normal--the distribution parallels the normal or bell curve). In addition, it means that numbers can be added, subtracted, multiplied, and divided. Parametric data are analyzed using statistical techniques identified as Parametric Statistics. As a rule, there are more statistical technique options for the analysis of parametric data and parametric statistics are considered more powerful than nonparametric statistics.
Nonparametric data are lacking those same parameters and cannot be added, subtracted, multiplied, and divided. For example, it does not make sense to add Social Security numbers to get a third person. Nonparametric data are analyzed by using Nonparametric Statistics.
The normality tests all report a P value. To understand any P value, you need to know the null hypothesis. In this case, the null hypothesis is that all the values were sampled from a Gaussian distribution. The P value answers the question:
If that null hypothesis were true, what is the chance that a random sample of data would deviate from the Gaussian ideal as much as these data do?
If n>30 Z-test is better
Because the one-tailed test provides more power to detect an effect, you may be tempted to use a one-tailed test whenever you have a hypothesis about the direction of an effect. Before doing so, consider the consequences of missing an effect in the other direction. Imagine you have developed a new drug that you believe is an improvement over an existing drug. You wish to maximize your ability to detect the improvement, so you opt for a one-tailed test. In doing so, you fail to test for the possibility that the new drug is less effective than the existing drug. The consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.
If n>30 Z-test is better
Degrees of freedom:
For the z-test degrees of freedom are not required since z-scores of 1.96 and 2.58 are used for 5% and 1% respectively. For unequal and equal variance t-tests = (n1 + n2) - 2 For paired sample t-test = number of pairs – 1
In the approach of Ronald Fisher, the null hypothesis H0 will be rejected when the p-value of the test statistic is sufficiently extreme (vis-a-vis the test statistic's sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, "extreme" is decided beforehand as either meaning "sufficiently small" or meaning "sufficiently large" – values in the other direction are considered not significant. In a two-tailed test, "extreme" means "either sufficiently small or sufficiently large", and values in either direction are considered significant
P value of Karl Pearson’s Chi squared test is different
In the approach of Ronald Fisher, the null hypothesis H0 will be rejected when the p-value of the test statistic is sufficiently extreme (vis-a-vis the test statistic's sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, "extreme" is decided beforehand as either meaning "sufficiently small" or meaning "sufficiently large" – values in the other direction are considered not significant. In a two-tailed test, "extreme" means "either sufficiently small or sufficiently large", and values in either direction are considered significant
In the approach of Ronald Fisher, the null hypothesis H0 will be rejected when the p-value of the test statistic is sufficiently extreme (vis-a-vis the test statistic's sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, "extreme" is decided beforehand as either meaning "sufficiently small" or meaning "sufficiently large" – values in the other direction are considered not significant. In a two-tailed test, "extreme" means "either sufficiently small or sufficiently large", and values in either direction are considered significant
In the approach of Ronald Fisher, the null hypothesis H0 will be rejected when the p-value of the test statistic is sufficiently extreme (vis-a-vis the test statistic's sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, "extreme" is decided beforehand as either meaning "sufficiently small" or meaning "sufficiently large" – values in the other direction are considered not significant. In a two-tailed test, "extreme" means "either sufficiently small or sufficiently large", and values in either direction are considered significant
When we test quantitative data, we need to see whether the data is normally distributed or non-normally distributed. Spearman’s correlation for ordinal data that isn’t normally distributed, Pearson’s correlation for normally distributed data
When we have a normal distribution , look at the number of groups and whether the data is paired. Paired data is where each patient in one group is matched against a smilar patient in the other group. If we have two groups and the data is unpaired, use Student’s t test. If there are two groups and the data is paired, use paired student’s t test. If there are > 2 groups, use ANOVA for paired or unpaired data as appropriate.
In a non-normal distribution, if there are two groups and the data is unpaired, use Mann-Whitney U test an dif the data is paired, use Wilcoxon Signed Rank Sum test. If there are > 2 groups and the data is unpaired, use Kruskal Wallis test and if the data is paired, use Friedman’s test.
Chi square distribution for different degree of freedom df = (Row-1) x (Column-1)
Fisher’s exact test
Fisher’s exact test: () Binomial coefficient & ! Factorial
If there is more than one factors (more than one-way) for testing means between groups it is called Factorial ANOVA
Null hypothesis in ANOVA is means of all the groups are equal
Omnibus (one for all)
Dependent variable depends on the Independent variable (it’s the effect) while Independent variable (the cause being tested) doesn’t depend on any other variable in the experiment but is directly controlled by the researcher.
If the Between Group Variation is significantly greater than the Within Group Variation, then it is likely that there is a statistically significant difference between the groups.
If the Between Group Variation is significantly greater than the Within Group Variation, then it is likely that there is a statistically significant difference between the groups.
From http://web.utah.edu/stat/introstats/anovaflash.html
S = SUBJECT DV = DEPENDENT VARIABLE
If multiple covariates are present we compute multiple correlation coefficients (Multiple Regression)
If there are two factors & repeated measures are done then a two-way ANOVA of repeated measures is done
The important point is that the same people are being measured more than once on the same dependent variable (i.e., why it is called repeated measures).
Unfortunately, repeated measures ANOVAs are particularly susceptible to violating the assumption of sphericity, which causes the test to become too liberal (i.e., leads to an increase in the Type I error rate; that is, the likelihood of detecting a statistically significant result when there isn't one). Fortunately, SPSS makes it easy to test whether your data has met or failed this assumption with Mauchly's Test of Sphericity
Once an overall significant difference in means is detected we have to do a pairwise comparison with Post hoc Bonferroni test to discover specific means that differ
Once an overall significant difference in means is detected we have to do a pairwise comparison with Post hoc Bonferroni test to discover specific means that differ
Dependent variable measured in same subjects at different times (independent variable)
Dependent variable measured in same subjects at different times (independent variable) compared for two factors
There is a within-subjects factor (Time) and a between-subjects factor (group)
There will be increased Type 1 errors
In ANOVA effect size is by Partial Eta squared
Cohen’s d is mean difference divided by pooled standard deviation