Chi squared test

PRESENTED BY:
Dhruv J. Patel
M. Pharm.1nd
sem.(2017-2018)
1DHRUV J PATEL

 A chi-square test, also written as χ² test, is
any statistical hypothesis test wherein the sampling
distribution of the test statistic is a chi-squared
distribution when the null hypothesis is true.
 The chi-square test is an important test amongst
the several tests of significance developed by
statisticians.
 It is was developed by Karl Pearson in1900.
2DHRUV J PATEL

 CHI SQUARE TEST is a non parametric test not based
on any assumption or distribution of any variable.
 The chi-squared test is used to determine whether
there is a significant difference between the
expected frequencies and the observed frequencies
in one or more categories.
 This statistical test follows a specific distribution
known as chi square distribution.
 In general The test we use to measure the
differences between what is observed and what is
expected according to an assumed hypothesis is
called the chi-square test.
3DHRUV J PATEL

• The term "non-parametric" refers to the
fact that the chi-square tests do not require
assumptions about population parameters
nor do they test hypotheses about
population parameters.
• Some examples of hypothesis tests, such as
the t tests and ANOVA, are parametric tests
and they do include assumptions about
parameters and hypotheses about
parameters. 6
4DHRUV J PATEL

• The most obvious difference
between the chi- square tests and
the other hypothesis tests we have
considered (t and ANOVA) is the
nature of the data.
• For chi-square, the data are frequencies
rather than numerical scores.
7
5DHRUV J PATEL

 1) PARAMETRIC TEST: The test in which, the
population constants like mean,std deviation,
std error, correlation coefficient, proportion
etc. and data tend to follow one assumed or
established distribution such as normal,
binomial, poisson etc.
 2) NON PARAMETRIC TEST: the test in which no
constant of a population is used. Data do not
follow any specific distribution and no
assumption are made in these tests. E.g. to
classify good, better and best we just allocate
arbitrary numbers or marks to each category.
6DHRUV J PATEL

 3) HYPOTHESIS: It is a definite statement about
the population parameters.
 4) NULL HYPOTHESIS: (H0) states that no
association exists between the two cross-
tabulated variables in the population, and
therefore the variables are statistically
independent. E.g. if we want to compare 2
methods method A and method B for its
superiority, and if the assumption is that both
methods are equally good, then this
assumption is called as NULL HYPOTHESIS.
7DHRUV J PATEL

 5) ALTERNATIVE HYPOTHESIS: (H1) proposes that
the two variables are related in the population. If
we assume that from 2 methods, method A is
superior than method B, then this assumption is
called as ALTERNATIVE HYPOTHESIS.
 6) DEGREE OF FREEDOM: It denotes the extent of
independence (freedom) enjoyed by a given set of
observed frequencies Suppose we are given a set of
n observed frequencies which are subjected to k
independent constraints(restrictions) then, d.f. =
(number of frequencies) – (number of independent
constraints on them) In other terms, df = (r – 1)(c –
1) where r = the number of rows c = the number of
columns.
8DHRUV J PATEL

 7) CONTINGENCY TABLE: When the table is
prepared by enumeration of qualitative data by
entering the actual frequencies, and if that table
represents occurance of two sets of events, that
table is called the contingency table. (Latin, con-
together, tangere- to touch). It is also called as an
association table.
9DHRUV J PATEL

 This test (as a non-parametric test) is based on
frequencies and not on the parameters like mean
and standard deviation.
 The test is used for testing the hypothesis and is
not useful for estimation.
 This test can also be applied to a complex
contingency table with several classes and as such
is a very useful test in research work.
 This test is an important non-parametric test as no
rigid assumptions are necessary in regard to the
type of population, no need of parameter values
and relatively less mathematical details are
involved. 10DHRUV J PATEL

1. Although test is conducted in terms of frequencies it
can be best viewed conceptually as a test about
proportions.
2. χ2 test is used in testing hypothesis and is not useful for
estimation.
3. Chi-square test can be applied to complex contingency
table with several classes.
4. Chi-square test has a very useful property i.e., ‘the
additive property’. If a number of sample studies are
conducted in the same field, the results can be pooled
together. This means that χ2-values can be added.
11DHRUV J PATEL

1. Quantitative data.
2. One or more categories.
3. Independent observations.
4. Adequate sample size (at least 10).
5. Simple random sample.
6. Data in frequency form.
7. All observations must be used.
12DHRUV J PATEL

 1) Goodness of fit of distributions
2) Test of independence of
attributes
3) Test of homogenity.
13DHRUV J PATEL

 This is used when you have one independent
variable, and you want to compare an observed
frequency-distribution to a theoretical expected
frequency-distribution.
 For the example described above, there is a
single independent variable (in this example
“age group”) with a number of different levels
(17-20, 21-30, 31-40, 41-50, 51-60 and over 60).
14DHRUV J PATEL

 The statistical question is: do the frequencies
you actually observe differ from the expected
frequencies by more than chance alone? In this
case, we want to know whether or not our
observed frequencies of traffic accidents occur
equally frequently for the different ages groups
(so that our theoretical frequency-distribution
contains the same number of individuals in each
of the age bands).
 This test enables us to see how well does the
assumed theoretical distribution (such as
Binomial distribution, Poisson distribution or
Normal distribution) fit to the observed data.
15DHRUV J PATEL

Where:
O is the observed frequency, and
E is the expected frequency.
16DHRUV J PATEL

• The chi-square test for goodness-of-fit uses
frequency data from a sample to test
hypotheses about the shape or proportions
of a population.
• Each individual in the sample is classified
into one category on the scale of
measurement.
• The data, called observed frequencies,
simply count how many individuals from
the sample are in each category.
17DHRUV J PATEL

 This test can also be used to test whether the
occurance of events follow uniformity or not
e.g. the admission of patients in government
hospital in all days of week is uniform or not
can be tested with the help of chi square test.
 χ²(calculated) < χ² (tabulated), then null
hypothesis is accepted, and it can be concluded
that there is a uniformity in the occurance of
the events. (uniformity in the admission of
patients through out the week).
18DHRUV J PATEL

 Remember, qualitative data is where you collect
data on individuals that are categories or
names.
 Then you would count how many of the
individuals had particular qualities. An example
is that there is a theory that there is a
relationship between breastfeeding and autism.
 To determine if there is a relationship,
researchers could collect the time period that a
mother breastfed her child and if that child was
diagnosed with autism.
19DHRUV J PATEL

 Then you would have a table containing this
information. Now you want to know if each cell is
independent of each other cell.
 Remember, independence says that one event
does not affect another event. Here it means that
having autism is independent of being breastfed.
What you really want is to see if they are not
independent.
 In other words, does one affect the other? If you
were to do a hypothesis test, this is your
alternative hypothesis and the null hypothesis is
that they are independent.
20DHRUV J PATEL

 There is a hypothesis test for this and it is
called the Chi-Square Test for Independence.
 Technically it should be called the Chi-Square
Test for Dependence, but for historical
reasons it is known as the test for
independence. Just as with previous
hypothesis tests, all the steps are the same
except for the assumptions and the test
statistic.
21DHRUV J PATEL

 In probability theory and statistics, the chi-
squared distribution (also chi-square or χ2-
distribution) with k degrees of freedom is the
distribution of a sum of the squares
of k independent standard normal random
variables.
 The chi-square distribution is a special case of
the gamma distribution and is one of the most
widely used probability
distributions in inferential statistics, e. g.,
in hypothesis testing or in construction
of confidence intervals.
DHRUV J PATEL 22

 When it is being distinguished from the more
general noncentral chi-squared distribution, this
distribution is sometimes called the central chi-
squared distribution.
 The chi-squared distribution is used in the
common chi-squared tests for goodness of fit of an
observed distribution to a theoretical one,
the independence of two criteria of classification
of qualitative data, and in confidence
interval estimation for a population standard
deviation of a normal distribution from a sample
standard deviation.
 Many other statistical tests also use this distribution,
such as Friedman's analysis of variance by ranks.
DHRUV J PATEL 23

 Various chi-square tests to deal with cases
involving frequency data
• Enumeration data
• Categorical data
• Qualitative data
• Contingency table
DHRUV J PATEL 24

DHRUV J PATEL 25
Where,
– O = observed data in
each category
– E = expected data in
each category based on
the experimenter’s
hypothesis
– S = Sum of the
calculations for each
category

DHRUV J PATEL 26
1 Draw a chi-square table.
Each row will be whom you voted for, giving us
two columns for Obama and Romney. Each row
will be where you live, giving us three rows –
rural, suburban and urban.

DHRUV J PATEL 27
2 Calculate totals for each row and
column.
The purpose of the first column total is to find
out how many votes Obama got from all areas.
Similarly, the purpose of the first row total is
to find out how rural votes were cast for either
candidate.

DHRUV J PATEL 28
3 Calculate probabilities for each
row and column.
These will be the individual probabilities of
voting Obama, voting Romney, living in the
country, etc… For example, the Obama column
total tells us that 54 out of 100 people polled
voted Obama, so probability of voting Obama
is 0.54.

DHRUV J PATEL 29
4 Calculate the joint probabilities of
belonging to each category.
For example, probability of being rural and an Obama voter is found by
multiplying the probability of voting Obama (0.54) with the probability of
living in the country (0.13). So, 0.54 x 0.13 = A person has a 0.0702 chance
of being a rural Obama voter.
In doing so, we assume that where you live and whom you voted for are
independent. This assumption, called the null hypothesis, may well be
wrong , and we will test it later by testing the joint probabilities it yielded.

DHRUV J PATEL 30
5 Based on these joint
probabilities, how many people do
we expect to belong to each
category?
We multiply the joint probability for
each category by 100, the number of
people.

DHRUV J PATEL 31
6 These expected numbers are based on the
assumption (hypothesis) that whom you voted
for and where you live are independent. We
can test this hypothesis by holding these
expected numbers against the actual numbers
we have.
.
First, we need our chi-square value .
.

Basically, the equation asks that, for
each category, you find the discrepancy
between the observed number and
expected number , square it, and then
divide it by the expected number .
Finally, add up the figures for each
category.
I got 0.769 as my chi-square value.
DHRUV J PATEL 32

7 Look at a chi-square table.
Note that out degrees of freedom in a chi square
test is . In our case, with 3 rows and 2 columns,
we get 2 degrees of freedom.
For a 0.05 level of significance and 2 degrees of
freedom, we get a threshold (minimum) chi-
square value of 5.991. Since our chi-square value
0.769 is smaller than the minimum, we cannot
reject the null hypothesis that where you
live and who you voted for are independent.
DHRUV J PATEL 33

1. N, the total frequency, should be reasonably
large, say greater than 50.
2. The sample observations should be independent.
This implies that no individual item should be
included twice or more in the sample.
3. The constraints on the cell frequencies, if any,
should be linear (i.e., they should not involve
square and higher powers of the frequencies)
such as ∑fo = ∑fe = N.
DHRUV J PATEL 34

4. No theoretical frequency should be small. Small
is a relative term.
 Preferably each theoretical frequency should be
larger than 10 but in any case not less than 5.
 If any theoretical frequency is less than 5 then we
cannot apply χ2 -test as such.
 In that case we use the technique of “pooling”
which consists in adding the frequencies which
are less than 5 with the preceding or succeeding
frequency (frequencies) so that the resulting sum
is greater than 5 and adjust for the degrees of
freedom accordingly.
DHRUV J PATEL 35

5. The given distribution should not be replaced by
relative frequencies or proportions but the data
should be given in original units.
6. Yates’ correction should be applied in special
circumstances when df = 1 (i.e. in 2 x 2 tables) and
when the cell entries are small.
7. χ2-test is mostly used as a non-directional test
(i.e. we make a two-tailed test).
 However, there may be cases when χ2 tests can be
employed in making a one-tailed test.
 In one-tailed test we double the P-value. For example
with df = 1, the critical value of χ2at 05 level is 2.706
(2.706 is the value written under. 10 level) and the
critical value of; χ2 at .01 level is 5.412 (the value is
written under the .02 level).
DHRUV J PATEL 36

 1. Testing the divergence of observed results
from expected results when our expectations
are based on the hypothesis of equal
probability.
 2. Chi-square test when expectations are based
on normal distribution.
 3. Chi-square test when our expectations are
based on predetermined results.
 4. Correction for discontinuity or Yates’
correction in calculating χ2.
 5. Chi-square test of independence in
contingency tables.
DHRUV J PATEL 37

• In many families where the parents could have produced
children of all four blood groups, the total number of
children with each blood group was :
– Blood group A
– Blood group B
– Blood group AB
– Blood group O
26
31
39
24
– TOTAL 120
Those numbers are the results observed and have been put
into the O column in this table

Blood
group
Observed
(o)
Expected
(E)
(O-E) (O-E)2 (O-E)2
E
A 26
B 31
AB 39
O 24
2
  (O  E)2
E
• Now we need to work out the expected results.
• This is done by using the ratio that was described
before 1:1:1:1 ….
• We just divide the total amount observed by 4.
• So we would expect to have 30 people of each
blood group…
120 / 4 = 30

Blood
group
Observed
(o)
Expected
(E)
(O-E) (O-E)2 (O-E)2
E
A 26 30
B 31 30
AB 39 30
O 24 30
2
(OE)2
E
• Then we need to write that in the E column.
• Next we need to do O-E.

Blood
group
Observed
(o)
Expected
(E)
(O-E) (O-E)2 (O-E)2
E
A 26 30 26-30 = -4
B 31 30 31-30 = 1
AB 39 30 39-30 = 9
O 24 30 24-30 = -6
• Then we square all the ( O – E) results…
Blood
group
Observed
(o)
Expected
(E)
(O-E) (O-E)2 (O-E)2
E
A 26 30 26-30 = -4 - 42 =16
B 31 30 31-30 = 1 12 =1
AB 39 30 39-30 = 9 92= 81
O 24 30 24-30 = -6 -62 =36

Blood
group
Observed
(o)
Expected
(E)
(O-E) (O-E)2 (O-E)2
E
A 26 30 26-30 = -4 - 42 =16 16/30 = 0.53
B 31 30 31-30 = 1 12 =1 1/30 = 0.03
AB 39 30 39-30 = 9 92= 81 81/30 = 2.7
O 24 30 24-30 = -6 -62 =36 36/30 = 1.2
Now, we divide all the ( O – E ) 2 results by E
• Finally , we add up the ( O – E)2 / E column…

Blood
group
Observed
(o)
Expected
(E)
(O-E) (O-E)2 (O-E)2
E
A 26 30 26-30 = -4 - 42 = 16 16/30 =
0.53
B 31 30 31-30 = 1 12 = 1 1/30 =
0.03
AB 39 30 39-30 = 9 92= 81 81/30 =
2.7
O 24 30 24-30 = -6 -62 = 36 36/30 =
1.2
E
2
(OE)2
4.46
• So the result is 4.46
• Then we need to look up this result on a probability table.

 In statistics, Yates's correction for
continuity (or Yates's chi-squared test) is used in
certain situations when testing for independence in
a contingency table. In some cases, Yates's correction
may adjust too far, and so its current use is limited.
 The following is Yates's corrected version of Pearson's
chi-squared statistics.
DHRUV J PATEL 45

Pure mathematics is, in its
way, the poetry of logical
ideas.
- Albert Einstein
DHRUV J PATEL 46

Chi squared test

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Chi squared test

Ähnlich wie Chi squared test (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Chi squared test