2. Contents
•
•
•
•
•
•
•
•
•
•
Definitions
Milestone in Statistics
Chi square test
Chi Square test Goodness of Fit
Chi square test for homogeneity of Proportion
Chi Square Independent test
Limitation of Chi square
Fischer Exact test
Continuity correction
Overuse of chi square
3. Definitions
• Statistics defined as the science, which deals
with collection, presentation, analysis and
interpretation of data.
• Biostatistics defined as application of
statistical method to medical, biological and
public health related problems.
5. Introduction
• Data : A collection of facts from which
conclusions can be made.
• An observations made on the subjects one after
the other is called raw data
– It becomes useful - when they are arranged and
organized in a manner that we can extract information
from the data and communicate it to others.
6.
7. Definitions
• A variable is any characteristics, number, or
quantity that can be measured or counted.
– Independent variable: doesn’t changed by the other
variables. E.g age
– Dependent variable: depends on other factors e.g test
score on time studied
• Parameter: is any numerical quantity that
characterizes a given population or some aspect
of it. E.g mean
9. Qualitative Data
•
•
Qualitative variables
Example: gender (male, female)
•
Frequency in category
•
Nominal or ordinal scale
•
Examples
– Do you have a disease? - nominal
– What is the Socio economic status ? – ordinal
10. MILESTONE IN STATISTICS
• "Karl Pearson's famous chi-square paper
appeared in the spring of 1900, an auspicious
beginning to a wonderful century for the field
of statistics." (published in the Philosophical
magazine )
11. Chi Square Test
• Simplest & most widely used non-parametric
test in statistical work.
12. Logic of the chi-square
• The total number of observations in each
column and the total number of observations in
each row are considered to be given or fixed.
• If we assume that columns and rows are
independent, we can calculate - expected
frequencies.
13. Logic of Chi square
If no relationship exists between the
column and row variable
If a relationship (or dependency) does occur
•
the observed frequencies will be very close
The observed frequencies will vary from the
to the expected frequencies
Compares thefrequencies frequency in
expected observed
with the expected frequency.
they will differ only by small amounts
The value of the chi-square statistic will be
large.
the value of the chi-square statistic will be
small
each cell
14. Steps for Chi square test
Define Null and alternative hypothesis
State alpha
Calculate degree of freedom
State decision rule
Calculate test statistics
State and Interpret results
15. Hypothesis Testing
•
Tests a claim about a parameter using evidence (data in
a sample)
gives causal relationships
Steps
1. Formulate Hypothesis about the population
2. Random sample
3. Summarizing the information (descriptive statistic)
4. Does the information given by the sample support the
hypothesis? Are we making any error? (inferential stat.)
• Decision rule: Convert the research question to null and
alternative hypothesis
16. Null Hypothesis
• H0 = No difference between observed and
expected observations
• H1 = difference is present between observed
and expected observations
17. What is statistical significance?
• A statistical concept indicating that the result is
very unlikely due to chance
and, therefore, likely represents a true
relationship between the variables.
• Statistical significance is usually indicated by
the alpha value (or probability value), which
should be smaller than a chosen significance
level.
18. State alpha value
• Alpha is error(type I) that is
• Rejecting a true null hypothesis
• For majority of the studies alpha is 0.05
• Meaning: the investigator has set 5% as the
maximum chance of incorrectly rejecting the
null hypothesis
19. Degree of freedom
It is positive whole number that indicates the
lack of restrictions in calculations.
Calculation
• For Goodness of Fit = Number of levels (outcome)-1
• For independent variables / Homogeneity of
The degree of (No. of columns –numberof rows – 1) in
proportion : freedom is the 1) (No. of values
a calculation that can vary.
20. The Chi-Square Distribution
• No negative values
• Mean is equal to the
degrees of freedom
• The standard deviation
increases as degrees of
freedom increase, so the
chi-square curve spreads
out more as the degrees of
freedom increase.
• As the degrees of freedom become
very large, the shape becomes
more like the normal distribution.
21.
22. The Chi-Square Distribution
• The chi-square distribution is different for each
value of the degrees of freedom, different
critical values correspond to degrees of
freedom.
• we find the critical value that separates the
area defined by α from that defined by 1 – α.
23. Finding Critical Value
Q. What is the critical 2 value if df = 2, and =0.05?
If ni = E(ni), 2 = 0
Reject H0
Do not reject H0
= 0.05
df =2
0
2 Table (Portion)
DF
1
2
0.995
...
0.010
5.991
2
Significance level
…
0.95
…
…
0.004
…
…
0.103
…
0.05
3.841
5.991
24. State decision rule
If the value obtained is greater than the
critical value of chi square , the null
hypothesis will be rejected
25. Expected Value
Calculate test statistics
• Calculated using the formulaChi square for independent variables
χ2 = of fit ( O – E )2
∑
Chi square for goodness
Homogeneity of proportion
E
O = observed frequencies
E = expected frequencies
• a theory
• Previous study
• Comparison groups
• Previous study
• standard
• Expected Value =
Row total *
Column total /
Table total
Question >>> How to find the Expected value
26. State and interpret results
• See whether the value of chi square is more
than or less than the critical value
If the value of chi square is less than the critical
value we accept the null hypothesis
If the value of chi square is more than the
critical value the null hypothesis can be
rejected
27. Chi square test
• Goodness of fit
• For homogeneity of Proportions
• For 2 independent groups
– Cohort Study
– Case control study
– Matched case control Study
• For > 2 independent groups
28. Goodness of fit
Q How "close" are the observed values tocan be based
Expected frequency those
which would be expected in a on theory
study
•
• previous experience
OR
• comparison groups
Q.whether a variable has a frequency distribution
compariable to the one expected.
Chi-square goodness of fit test
29. Goodness of fit
• A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a claimed
distribution.
• It is a test of the agreement or conformity
between the observed frequencies (Oi) and the
expected frequencies (Ei) for several classes or
categories (i)
30. Example :Is Sudden Infant Death Syndrome seasonal??
Null Hypothesis: The proportion of deaths due to SIDS in winter , summer , autumn , spring
is equal = ¼ = 25%
Alternative :Not all probabilities stated a in null hypothesis is correct
SIDS cases
Observed
Expected = 322*1/4
Summer
78
80.5
Spring
71
80.5
Autumn
87
80.5
Winter
86
80.5
Total
322
For α =0.05 for df =3 critical value X2 = 7.81
X2 = (78-80.5)2/80.5 + (71- 80.5)2/80.5 + (87.5 – 80.5)2/80.5 + (86 – 80.5)2/80.5 = 2.09
Degree of freedom = k-1 = 4-1 =3
Conclusion: As calculated X2 value is less than
Critical value we can accept the null hypothesis
and state that deaths due to SIDS across seasons
are not statistically different from what's
expected by chance (i.e. all seasons being equal)
31. Chi square test
• Goodness of fit
• For homogeneity of Proportions
• For 2 independent groups
– Cohort Study
– Case control study
– Matched case control Study
• For > 2 independent groups
32. Homogeneity of proportions
• In a chi-square test for homogeneity of
proportions, we test the claim that different
populations have the same proportion of
individuals with some characteristic.
EXAMPLE: Is there evidence to indicate that the perception of effects of vaccination is
the same in 2013 as was in 2000?
Q what is the effect of vaccination on health ?
Answers :- Good , No , Bad
Null hypothesis: Ho = No difference between the two population
H1 = There is difference between the two population
33. State alpha = 0.05 find df = (3-1)(2-1)= 2
=5.99
Chi square
distribution
X2= 5.991
34. 2000
2013
Expected
2000
frequency
Good -656
No- 283
Good effect
(989)(1382)/1
Bad- 50
987 = 687.87
2013
No effect
(989)(505)/19
87 = 251.36
2000
(998)(505)/1987
= 253.64
2013
656
(989)(100)/19
87= 49.77
283
726
(998)(100)/1987
= 50.23
222
Observed
Good
Bad effect
No effect
Bad
Total
Column total
(998)(1382)/198
7=694.13
50
989
989
50
998
998
Row total
Good- 726
No-222
1382 Bad -50
505
Total
1382
100
505
100
1987 1987
35. Homogeneity of proportions
• χ2 value = ∑ (O-E)2/E
Calculated χ2= 10.871
Results: as 10.871> 5.991 we reject the null
hypothesis at 0.05 significance .
>There is a statistically significant difference in
the level of feeling towards vaccination between
2000 and 2013
36. Chi square test
• Goodness of fit
• For homogeneity of Proportions
• For 2 independent groups
– Cohort Study
– case control study
– Matched case control Study
• For > 2 independent groups
37. Chi square Independence test
• It is used to find out whether there is an
association between a row variable and column
variable in a contingency table constructed
from sample data.
38. Assumption
• The variables should be independent.
• All expected frequencies are greater than or
equal to 1 (i.e., E>1.)
• No more than 20% of the expected frequencies
are less than 5
Calculated as
χ2 value = ∑ (O-E)2/E
39. Expected Count
Joint probability =
Exposure
a+b a+c
tt tt
Marginal probability = a+b
tt
Location
Disease
Disease
present
neg.
Total
Present
a
Negative
c
d
c+d
Total
a+c
b+d
tt
Marginal probability =
b
a+c
tt
Expected count =
a+ b
sample size
(tt)
a+b a+c
tt tt
43. Application in various studies
• Cohort study
• Case control study
• Matched case control study
44. Cohort Study
Assumptions:
• The two samples are independent
• Let a+b = number of people exposed to the risk factor
• Let c+d = number of people not exposed to the risk factor
Assess whether there is association between exposure and disease
by calculating the relative risk (RR)
45. Example: To test the association in a cohort study among smoking and Lung CA
Null hypothesis :Ho=the association risk of Smoking and Lung CA (RR=1)
We can define No relative between disease:
H1 =Association present b/w smoking and Lung CA
p1= (Incidence of disease in exposure present)
p2 = (Incidence of disease exposure CA
Sing
Lung CA
Lung absent)
Total
present
absent
Relative risk
YES
84
2914
3000
RR= p1/p2 NO
87
4913
5000
Hence for these studies
TOTAL
171
7827
8000
RR= (a/a+ b)/(c/c + d)
RR = (84/3000)/(87/5000)=1.21
We can test the hypothesis that RR=1 by calculating the
Alpha value= 0.05 and df = 1
chi-square test statistic
CONCLUSION:As the X2 > than
3.82 we reject the null hypothesis
of RR=1 at 0.05 significance.
46. Case control study
Assumptions
• The samples are independent
• Cases = diseased individuals = a+c
• Controls = non-diseased individuals = b+d
Assess whether there is association between exposure and
disease by calculating the odds ratio (OR)
47. Example: To test the association in a case control study between CHD and
smoking
Null hypothesis Ho: No association between CHD and smoking(OR=1)
H1= Association exists between CHD and Smoking(OR>1 or<1)
• Odd’s Ratio = odd’s of exposure amongst
diseased group/ odd’s of exposure amongst non
diseased
• odd’s of exposure amongst diseased =
(a/a+c)/(c/a+c) = a/c
• Odd’s of exposure amongst non diseased =
(b/b+d)/(d/b+d) = b/d
• Odd’s Ratio = ad/ bc
• Odd’s Ratio=112*224/88*176 = 1.62
We can test whether OR=1 by calculating the
chi-square0.05 and df = 1
Alpha value=
Conclusion: we reject the null
hypothesis that odd’s ratio = 1 at
0.05 significance as X2 > 3.84
48. Matched case control study
• Case-control pairs are matched on characteristics such as age, race, sex
Assumptions
• Samples are not independent
• The discordant pairs are case-control pairs with different exposure histories
• The matched odds ratio is estimated by bb/cc
Pairs in which cases exposed but controls not = bb
Pairs in which controls exposed but cases not = cc
Assess whether there is association between exposure and disease by calculating
the matched odds ratio (OR)
49. To test association of smoking exposure and CHD in a matched case control
study
Null hypothesis : No association of smoking exposure and CHD (OR=1)
Alternative Hypothesis: Association exists between smoking exposure and CHD(OR>1 or< 1)
CHD absent
• Test whether OR = 1 by calculating
Smoking history
Smoking history
McNemar’s statistic
present
absent
Smoking history
present
20
40(bb)
Smoking history
absent
CHD present
10(cc)
30
Alpha value= 0.05 and df = 1
OR=40/10 = 4
X2= [(40-10)-1]2/(40+10) = 841/50 = 16.81
Conclusion: We reject the Null Hypothesis
that OR =1 as calculated X 2 >3.84
50. Chi square for > 2 independent
variables
• The chi-square test is used regardless of
whether the research question in terms of
proportions or frequencies
• Contingency tables can have any number of
rows and columns.
• The sample size needs to increase as the
number of categories increases to keep the
expected values of an acceptable size.
51. Limitation of Chi square test
• Conditions for approximation of chi square is
adequate:
– No expected frequency should be <2
– No more than 20%of the cells should have an
expected frequency < 5
Question : What to do when these assumptions are not met?
Fischer Exact test
52. Fisher Exact test
• Gives exact probability of the occurrence of
the observed frequencies
• Fisher's exact test is especially appropriate
with
– small sample sizes (Total number of cases is
<20 )
or
– if expected number of cases in any cell is <2
or
– If more than 20% of the cells have expected
frequencies <5
Ronald A.
Fisher
(1890–1962)
53. Continuity correction
• It subtracts ½ from the difference between
observed and expected frequencies in the
numerator of χ2 before squaring;
• It makes the value for χ2 smaller >>>>
acceptance of null hypothesis >>decrease type
I error
• In the shortcut formula, n/2 is subtracted from
the absolute value of ad – bc prior to squaring.
54. Overuse of Chi square
When two groups are being analyzed and the characteristic of
interest is measured on a numerical scale.
Instead of correctly using the t test, researchers convert the
numerical scale to an ordinal or even binary scale and then use
chi-square
When numerical variables are analyzed with methods designed
for ordinal or categorical variables, the greater specificity or
detail of the numerical measurement is wasted.
Categorize a numerical variable, such as age, but only after
investigating whether the categories are appropriate
55. Take Home Message
• Chi square test applied on Qualitative data may it
be nominal or ordinal.
• Before applying Chi square test see all
assumptions are met
• If value of chi square is large >>>there is a high
probability of rejecting the null hypothesis
• If the value of chi square is small >>>there is less
probability of rejecting the null hypothesis
56. References
• Dawson :Basic and clinical statistics
• K. Park. : Textbook on Preventive and Social
Medicine
• John Hopkins Boomberg: Use of Chi square
• Non Parametric tests for non statisticians:
Foreman and Corner
• IBM: SPSS Help
Editor's Notes
Data are measurements or observations that are collected as a source of informationA data unit is one entity (such as a person or business) in the population being studied, about which data are collected.An observation is an occurrence of a specific data item that is recorded about a data unit.A dataset is a complete collection of all observations.
History of statistics can be said to start around 1749 although, over time, there have been changes to the interpretation of the word statistics.Philosophical Magazine, Series 5, Vol 50, pp. 157-175.The Philosophical Magazine is one of the oldest scientific journals published in English. It was established by Alexander Tilloch in 1798
(These column and row totals are also called marginal frequencies.)the number of observations expected to occur by chance—the expected frequencies.
no relationship exists between theIf a relationship (or dependency) does occur, the observed frequencies will vary quite a bit from the expected frequencies, and the value of the chi-square statistic will be large.If >>>>>> >>>. column and row variable
Calculate degree of freedom
If β is set at 0.10, then the investigator has decided that he is willing to accept a 10% chance of missing an association of a given effect size between Tamiflu and psychosis. This represents a power of 0.90, i.e., a 90% chance of finding an association of that size. For example, suppose that there really would be a 30% increase in psychosis incidence if the entire population took Tamiflu. Then 90 times out of 100, the investigator would observe an effect of that size or larger in his study. This does not mean, however, that the investigator will be absolutely unable to detect a smaller effect; just that he will have less than 90% likelihood of doing so.Many studies set alpha at 0.05 and beta at 0.20 (a power of 0.80). These are somewhat arbitrary values, and others are sometimes used; the conventional range for alpha is between 0.01 and 0.10; and for beta, between 0.05 and 0.20. In general the investigator should choose a low value of alpha when the research question makes it particularly important to avoid a type I (false-positive) error, and he should choose a low value of beta when it is especially important to avoid a type II error.
The mean of the chi-square distribution is equal to thedegrees of freedom; therefore, as the degrees of freedom increase, the mean moves more to the rightA one-sided test or directional test is one in which the direction of departure from the null hypothesis has been specified in advance. This is only applicable in the case of a single degree of freedom test, since with more than one degree of freedom, there is more than just a single direction by which the results can depart from the null hypothesis. Thus all tests involving multiple degrees of freedom for an hypothesis are going to be two-sided or nondirectional. If the distribution being used in the test is symmetric, then one-sided corresponds with one-tailed. However, with distributions such as chi-squares and Fs, which are not symmetric, the standard tests use only one tail, but are two-sided, or nondirectional. The chi-square distribution with 1 df is the same as the distribution of the square of a Z or standard normal distribution. The F distribution with 1 numerator df is the same as the distribution of the square of a t with the same denominator df. Compare the critical values in tables or that you get from SPSS to verify these relationships. They show that a one-tailed chi-square test gives the same results as a two-tailed Z-test, and that a one-tailed F test gives the same results as a two-tailed t test; all are two-sided or nondirectiona
Only upper-tailed values for χ2 because these are the values generally used in hypothesis testing
Sum of probability is equal to 1
Mcnemar – repeated measured design or matched in case control study
A smaller value for χ2 means that the null hypothesis will not be rejected as often as it is with the larger, uncorrected chi-square; that is, it is more conservative. Thus, the risk of a type I error (rejecting the null hypothesis when it is true) is smaller; however, the risk of a type II error (not rejecting the null hypothesis when it is false and should be rejected) then increases