Chi Square Test Guide

CHI SQUARE TEST

DR HAR ASHISH
JINDAL
JR

Contents
•
•
•
•
•
•
•
•
•
•

Definitions
Milestone in Statistics
Chi square test
Chi Square test Goodness of Fit
Chi square test for homogeneity of Proportion
Chi Square Independent test
Limitation of Chi square
Fischer Exact test
Continuity correction
Overuse of chi square

Definitions
• Statistics defined as the science, which deals
with collection, presentation, analysis and
interpretation of data.

• Biostatistics defined as application of
statistical method to medical, biological and
public health related problems.

Statistics

Descriptive

Collecting
Organizing
Summarizing
Presenting Data

Inferential

Making inference
Hypothesis testing
Chi Determining
Square Test
relationships
Making predictions

Introduction
• Data : A collection of facts from which
conclusions can be made.
• An observations made on the subjects one after
the other is called raw data
– It becomes useful - when they are arranged and
organized in a manner that we can extract information
from the data and communicate it to others.

Definitions
• A variable is any characteristics, number, or
quantity that can be measured or counted.
– Independent variable: doesn’t changed by the other
variables. E.g age
– Dependent variable: depends on other factors e.g test
score on time studied

• Parameter: is any numerical quantity that
characterizes a given population or some aspect
of it. E.g mean

Data Types
DISCRETE

Interval data

QUANTITATIVE
CONTINOUS
Ratio data

Data

NOMINAL
QUALITATIVE
ORDINAL

Qualitative Data
•
•

Qualitative variables
Example: gender (male, female)

•

Frequency in category

•

Nominal or ordinal scale

•

Examples
– Do you have a disease? - nominal
– What is the Socio economic status ? – ordinal

MILESTONE IN STATISTICS
• "Karl Pearson's famous chi-square paper
appeared in the spring of 1900, an auspicious
beginning to a wonderful century for the field
of statistics." (published in the Philosophical
magazine )

Chi Square Test

• Simplest & most widely used non-parametric
test in statistical work.

Logic of the chi-square
• The total number of observations in each
column and the total number of observations in
each row are considered to be given or fixed.

• If we assume that columns and rows are
independent, we can calculate - expected
frequencies.

Logic of Chi square
If no relationship exists between the
column and row variable
If a relationship (or dependency) does occur

•

the observed frequencies will be very close
The observed frequencies will vary from the
to the expected frequencies
Compares thefrequencies frequency in
expected observed

with the expected frequency.
they will differ only by small amounts

The value of the chi-square statistic will be
large.
the value of the chi-square statistic will be
small

each cell

Steps for Chi square test
Define Null and alternative hypothesis

State alpha
Calculate degree of freedom
State decision rule

Calculate test statistics
State and Interpret results

Hypothesis Testing
•

Tests a claim about a parameter using evidence (data in
a sample)
gives causal relationships

Steps
1. Formulate Hypothesis about the population
2. Random sample
3. Summarizing the information (descriptive statistic)
4. Does the information given by the sample support the
hypothesis? Are we making any error? (inferential stat.)
• Decision rule: Convert the research question to null and
alternative hypothesis

Null Hypothesis
• H0 = No difference between observed and
expected observations
• H1 = difference is present between observed
and expected observations

What is statistical significance?
• A statistical concept indicating that the result is
very unlikely due to chance
and, therefore, likely represents a true
relationship between the variables.
• Statistical significance is usually indicated by
the alpha value (or probability value), which
should be smaller than a chosen significance
level.

State alpha value
• Alpha is error(type I) that is
• Rejecting a true null hypothesis

• For majority of the studies alpha is 0.05
• Meaning: the investigator has set 5% as the
maximum chance of incorrectly rejecting the
null hypothesis

Degree of freedom
It is positive whole number that indicates the
lack of restrictions in calculations.
Calculation
• For Goodness of Fit = Number of levels (outcome)-1
• For independent variables / Homogeneity of
The degree of (No. of columns –numberof rows – 1) in
proportion : freedom is the 1) (No. of values

a calculation that can vary.

The Chi-Square Distribution
• No negative values
• Mean is equal to the
degrees of freedom
• The standard deviation
increases as degrees of
freedom increase, so the
chi-square curve spreads
out more as the degrees of
freedom increase.
• As the degrees of freedom become
very large, the shape becomes
more like the normal distribution.

The Chi-Square Distribution
• The chi-square distribution is different for each
value of the degrees of freedom, different
critical values correspond to degrees of
freedom.
• we find the critical value that separates the
area defined by α from that defined by 1 – α.

Finding Critical Value
Q. What is the critical 2 value if df = 2, and  =0.05?

If ni = E(ni), 2 = 0

Reject H0

Do not reject H0

 = 0.05

df =2

0
2 Table (Portion)

DF
1
2

0.995
...
0.010

5.991

2

Significance level
…
0.95
…
…
0.004
…
…
0.103
…

0.05
3.841
5.991

State decision rule
If the value obtained is greater than the
critical value of chi square , the null
hypothesis will be rejected

Expected Value

Calculate test statistics
• Calculated using the formulaChi square for independent variables
χ2 = of fit ( O – E )2
∑
Chi square for goodness
Homogeneity of proportion
E
O = observed frequencies
E = expected frequencies
• a theory
• Previous study
• Comparison groups

• Previous study
• standard

• Expected Value =
Row total *
Column total /
Table total

Question >>> How to find the Expected value

State and interpret results
• See whether the value of chi square is more
than or less than the critical value

If the value of chi square is less than the critical
value we accept the null hypothesis

If the value of chi square is more than the
critical value the null hypothesis can be
rejected

Chi square test
• Goodness of fit
• For homogeneity of Proportions
• For 2 independent groups
– Cohort Study
– Case control study
– Matched case control Study

• For > 2 independent groups

Goodness of fit
Q How "close" are the observed values tocan be based
Expected frequency those
which would be expected in a on theory
study
•
• previous experience
OR
• comparison groups
Q.whether a variable has a frequency distribution
compariable to the one expected.

Chi-square goodness of fit test

Goodness of fit
• A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a claimed
distribution.
• It is a test of the agreement or conformity
between the observed frequencies (Oi) and the
expected frequencies (Ei) for several classes or
categories (i)

Example :Is Sudden Infant Death Syndrome seasonal??
Null Hypothesis: The proportion of deaths due to SIDS in winter , summer , autumn , spring
is equal = ¼ = 25%
Alternative :Not all probabilities stated a in null hypothesis is correct
SIDS cases

Observed

Expected = 322*1/4

Summer

78

80.5

Spring

71

80.5

Autumn

87

80.5

Winter

86

80.5

Total

322

For α =0.05 for df =3 critical value X2 = 7.81
X2 = (78-80.5)2/80.5 + (71- 80.5)2/80.5 + (87.5 – 80.5)2/80.5 + (86 – 80.5)2/80.5 = 2.09
Degree of freedom = k-1 = 4-1 =3

Conclusion: As calculated X2 value is less than
Critical value we can accept the null hypothesis
and state that deaths due to SIDS across seasons
are not statistically different from what's
expected by chance (i.e. all seasons being equal)

Homogeneity of proportions
• In a chi-square test for homogeneity of
proportions, we test the claim that different
populations have the same proportion of
individuals with some characteristic.
EXAMPLE: Is there evidence to indicate that the perception of effects of vaccination is
the same in 2013 as was in 2000?
Q what is the effect of vaccination on health ?
Answers :- Good , No , Bad

Null hypothesis: Ho = No difference between the two population
H1 = There is difference between the two population

State alpha = 0.05 find df = (3-1)(2-1)= 2
=5.99

Chi square
distribution

X2= 5.991

2000

2013

Expected
2000
frequency
Good -656
No- 283
Good effect
(989)(1382)/1
Bad- 50
987 = 687.87

2013

No effect

(989)(505)/19
87 = 251.36
2000

(998)(505)/1987
= 253.64
2013

656
(989)(100)/19
87= 49.77
283

726
(998)(100)/1987
= 50.23
222

Observed
Good
Bad effect
No effect
Bad
Total
Column total

(998)(1382)/198
7=694.13

50
989
989

50
998

998

Row total
Good- 726
No-222
1382 Bad -50

505
Total
1382
100
505
100
1987 1987

Homogeneity of proportions
• χ2 value = ∑ (O-E)2/E
Calculated χ2= 10.871
Results: as 10.871> 5.991 we reject the null
hypothesis at 0.05 significance .
>There is a statistically significant difference in
the level of feeling towards vaccination between
2000 and 2013

Chi square test
• Goodness of fit
• For homogeneity of Proportions
• For 2 independent groups
– Cohort Study
– case control study
– Matched case control Study

• For > 2 independent groups

Chi square Independence test

• It is used to find out whether there is an
association between a row variable and column
variable in a contingency table constructed
from sample data.

Assumption
• The variables should be independent.
• All expected frequencies are greater than or
equal to 1 (i.e., E>1.)
• No more than 20% of the expected frequencies
are less than 5

Calculated as
χ2 value = ∑ (O-E)2/E

Expected Count
Joint probability =

Exposure

a+b a+c
tt tt

Marginal probability = a+b
tt

Location
Disease
Disease
present
neg.

Total

Present

a

Negative

c

d

c+d

Total

a+c

b+d

tt

Marginal probability =

b

a+c
tt

Expected count =

a+ b

sample size
(tt)

a+b a+c
tt tt

Short cut of Chi Square
Observed values

Expected values

=> (37- 22.5)2/22.5 +(13 –
27.5)2/27.5 +(17-31.5)2
/31.5+ (53-38.5)2/38.5 =
29.1

 120[(37)(53)(13)(17)]2
/ 54(66)(50)(70)
= 29.1

Application in various studies
• Cohort study
• Case control study
• Matched case control study

Cohort Study

Assumptions:
• The two samples are independent
• Let a+b = number of people exposed to the risk factor
• Let c+d = number of people not exposed to the risk factor
Assess whether there is association between exposure and disease
by calculating the relative risk (RR)

Example: To test the association in a cohort study among smoking and Lung CA
Null hypothesis :Ho=the association risk of Smoking and Lung CA (RR=1)
We can define No relative between disease:
H1 =Association present b/w smoking and Lung CA

p1= (Incidence of disease in exposure present)
p2 = (Incidence of disease exposure CA
Sing
Lung CA
Lung absent)
Total
present
absent
Relative risk
YES
84
2914
3000
RR= p1/p2 NO
87
4913
5000
Hence for these studies
TOTAL
171
7827
8000
RR= (a/a+ b)/(c/c + d)
RR = (84/3000)/(87/5000)=1.21

We can test the hypothesis that RR=1 by calculating the
Alpha value= 0.05 and df = 1
chi-square test statistic
CONCLUSION:As the X2 > than
3.82 we reject the null hypothesis
of RR=1 at 0.05 significance.

Case control study

Assumptions
• The samples are independent
• Cases = diseased individuals = a+c
• Controls = non-diseased individuals = b+d
Assess whether there is association between exposure and
disease by calculating the odds ratio (OR)

Example: To test the association in a case control study between CHD and
smoking
Null hypothesis Ho: No association between CHD and smoking(OR=1)
H1= Association exists between CHD and Smoking(OR>1 or<1)

• Odd’s Ratio = odd’s of exposure amongst
diseased group/ odd’s of exposure amongst non
diseased
• odd’s of exposure amongst diseased =
(a/a+c)/(c/a+c) = a/c
• Odd’s of exposure amongst non diseased =
(b/b+d)/(d/b+d) = b/d
• Odd’s Ratio = ad/ bc
• Odd’s Ratio=112*224/88*176 = 1.62
We can test whether OR=1 by calculating the
chi-square0.05 and df = 1
Alpha value=

Conclusion: we reject the null
hypothesis that odd’s ratio = 1 at
0.05 significance as X2 > 3.84

Matched case control study

• Case-control pairs are matched on characteristics such as age, race, sex
Assumptions
• Samples are not independent
• The discordant pairs are case-control pairs with different exposure histories
• The matched odds ratio is estimated by bb/cc
Pairs in which cases exposed but controls not = bb
Pairs in which controls exposed but cases not = cc
Assess whether there is association between exposure and disease by calculating
the matched odds ratio (OR)

To test association of smoking exposure and CHD in a matched case control
study
Null hypothesis : No association of smoking exposure and CHD (OR=1)
Alternative Hypothesis: Association exists between smoking exposure and CHD(OR>1 or< 1)
CHD absent

• Test whether OR = 1 by calculating
Smoking history
Smoking history
McNemar’s statistic
present
absent
Smoking history
present

20

40(bb)

Smoking history
absent

CHD present

10(cc)

30

Alpha value= 0.05 and df = 1

OR=40/10 = 4

X2= [(40-10)-1]2/(40+10) = 841/50 = 16.81
Conclusion: We reject the Null Hypothesis
that OR =1 as calculated X 2 >3.84

Chi square for > 2 independent
variables
• The chi-square test is used regardless of
whether the research question in terms of
proportions or frequencies
• Contingency tables can have any number of
rows and columns.
• The sample size needs to increase as the
number of categories increases to keep the
expected values of an acceptable size.

Limitation of Chi square test
• Conditions for approximation of chi square is
adequate:
– No expected frequency should be <2
– No more than 20%of the cells should have an
expected frequency < 5

Question : What to do when these assumptions are not met?
Fischer Exact test

Fisher Exact test
• Gives exact probability of the occurrence of
the observed frequencies
• Fisher's exact test is especially appropriate
with
– small sample sizes (Total number of cases is
<20 )

or
– if expected number of cases in any cell is <2

or
– If more than 20% of the cells have expected
frequencies <5

Ronald A.
Fisher
(1890–1962)

Continuity correction
• It subtracts ½ from the difference between
observed and expected frequencies in the
numerator of χ2 before squaring;
• It makes the value for χ2 smaller >>>>
acceptance of null hypothesis >>decrease type
I error
• In the shortcut formula, n/2 is subtracted from
the absolute value of ad – bc prior to squaring.

Overuse of Chi square
When two groups are being analyzed and the characteristic of
interest is measured on a numerical scale.
Instead of correctly using the t test, researchers convert the
numerical scale to an ordinal or even binary scale and then use
chi-square
When numerical variables are analyzed with methods designed
for ordinal or categorical variables, the greater specificity or
detail of the numerical measurement is wasted.
Categorize a numerical variable, such as age, but only after
investigating whether the categories are appropriate

Take Home Message
• Chi square test applied on Qualitative data may it
be nominal or ordinal.
• Before applying Chi square test see all
assumptions are met
• If value of chi square is large >>>there is a high
probability of rejecting the null hypothesis
• If the value of chi square is small >>>there is less
probability of rejecting the null hypothesis

References
• Dawson :Basic and clinical statistics
• K. Park. : Textbook on Preventive and Social
Medicine
• John Hopkins Boomberg: Use of Chi square
• Non Parametric tests for non statisticians:
Foreman and Corner
• IBM: SPSS Help

Chi Square Test Guide

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chi Square Test Guide

Similar to Chi Square Test Guide (20)

More from Har Jindal

More from Har Jindal (19)

Recently uploaded

Recently uploaded (20)

Chi Square Test Guide

Editor's Notes