Lecture of Respected Sir Dr. L.M. BEHERA from N.I.H. KOLKATA in a workshop at G.D.M.H.M.C. - Patna in the Year 2011.
SUBJECT : BIOSTATISTICS
TOPIC : 'INTRODUCTION TO BIOSTATISTICS'.
2. THE WORD
• Latin – status
• Italian – statista
• German – statistik
• French - statistique
3. EMINENT STATISTICIANS
Captain John Gront of
London (1620-1674)
was known as the
Father of Vital statistics
R. A. Fisher gave it the
status of a full fledged
science
4. TO DEFINE BIO-STATISTICS
The procedure of collection, compilation,
analysis, and interpretation of biological
information / data.
5. INFORMATION / DATA
INFORMATION
• It is the first hand
knowledge regarding
anything
• Information as such is
not conclusive
• Ex- in India more
diabetics are there – it
is only informative
DATA
• It is assorted or
arranged information
• Data helps in drawing a
conclusion
• Ex - In India every
seventh person is
diabetic – it gives a data
6. Information / data
• We are concerned about data and not
information
• To get the data we are required to arrange the
information
• So we are concerned about both
8. METHODS OF DATA COLLECTION
CENSUS
– In this type of data
collection each individual
unit of information is being
recorded
– Time consuming
– Financially not viable
– Practically impossible
Example- life of an electric
bulb
SAMPLING METHOD
– Selected representatives of
the data is recorded
– Whole data is not required to
be recorded
– Easy to collect
– Practically possible
– By analyzing this sample data
conclusion is drawn.
Example- case of hypertension
9. PREREQUISITES FOR COLLECTING DATA
• Objectives and scope of the inquiry
• Statistical units to be used
• Sources of data / information
• Method of data collection
• Degree of accuracy aimed in the final result
10. CLASSIFICATION OF DATA
QUALITATIVE OR ATTRIBUTE
• It can not be expressed by
numbers
• It is not measurable
• But can be classified
under different categories
• Ex- religion, blood group,
sex
QUANTITATIVE OR VARIBLES
– Continuous variables
– Discrete varibales
11. QUANTITATIVE OR VARIBLES contd…
• Continuous variables
• It is expressed in
numbers
• It can be measured
• Ex- body temp., heart
rate, etc
• Discrete variables
• It is countable
• Ex- no. of patients in
hospital, opd, etc.
12. DISCRETE VARIABLES contd…
Primary
– Collected directly
– Ex- No of children having
defective vision in a
school, etc.
Secondary
– Previously collected
– Used by others
– No of HIV patients
– No of still births
13. Sources of HEALTH DATA
1. Census
2. Registration of vital events
3. Sample registration system
4. Notification of disease
5. Hospital records
6. Record linkage
7. Disease registration
8. Epidemiological
surveillance
9. Other health services
records
10. Environmental health data
11. Population surveys
12. Other routine statistics
related to health
13. Health manpower
statistics
14. Non-quantitable
information
14. SAMPLE – WHAT DO WE MEAN
A PORTION OR PART OF A TOTAL POPULATION
FOLLOWING CERTAIN GUIDELINE
15. SAMPLING TECHNIQUE contd…
• Statistics is a
characteristics of
sample
• Parameter is a
characteristics of
population
16. SAMPLING TECHNIQUES contd…
SELECTIVE SAMPLING
• Or non-random sampling
• Or purposive sampling
In this process,
sampling is done by choice and not by chance
17. RANDOM SAMPLING
• Sample is chosen according to a guideline
• The incidence that one information being chosen as
sample is following the guideline i.e. by chance and
not by choice
• But all the information in that population has got
equal opportunity for being selected
• No bias in collecting the sample
• Sample mean is very close to the population mean
Sampling techniques contd…
18. TYPES OF RANDOM SAMPLING
• SIMPLE RANDOM
• STRATIFIED RANDOM
• SYSTEMATIC RANDOM
• CLUSTER SAMPLING
• MULTI STAGE SAMPLING
• MULTI PHASE SAMPLING
19. SIMPLE RANDOM
• RULE IS THAT EACH INFORMATION IN THIS
POPULATION HAS GOT EQUAL OPPRTUNITY FOR BEING
CHOSEN
• EACH INFORMATION IS NUMBERED, THEN CERTAIN
NUMBER IS RANDOMLY SELECTED
Example- there are 100 students in class, each having roll
numbers attached to them – now to select ten
students simply to call any ten students by their roll
number
20. STRATIFIED RANDOM
• Here the information in a population is first
categorised into small groups
• Then the sample is chosen from each small group
• Example- in the same above 100 students to select
ten students the category can be 10 small groups
from roll number 1 to 10, 11 to 20, 21 to 30, and so
on up to 90 to 100.
• Then from each small groups one student can be
selected
21. • Advantages – increases the precision of
calculation
• Each sub group is being studied as a separate
population
STRATIFIED RANDOM
22. SYSTEMATIC RANDOM
• This sampling is done following a rule
• Each information is numbered
• Then total population divided by number of samples
required
• One number is selected equal or less than the above
fraction (sampling interval)
• Then to that number the fraction is added to get the
sample to be selected
23. • Example-
• 300 persons are there
• 30 samples are required to be selected
• First number all the persons
• 300/30=10 (sampling interval)
• Pick any number equal to or less than ten
• Suppose the number is 7
• Then 7+10=17, 17+10=27, 27+10=37,etc.
SYSTEMATIC RANDOM
24. • Merits
– Easy to follow
– Less tedious
– Gives almost accurate result
SYSTEMATIC RANDOM
25. CLUSTER SAMPLING
• If the population is very big in size
• Then divide it into smaller size population
• This should be non-overlapping
• These are called clusters
• Then randomly select some clusters
• Then apply the method of systematic random
sampling to each randomly selected cluster
26. Example of cluster sampling
• 30 cluster random sampling technique by
WHO used for evaluation of immunization
coverage of vaccines.
• It is an unique study design to evaluate
routine immunization of coverage of vaccines
27. HOW IT IS DONE
• Required sample size is 210 children between 12 to
23 months
• From 30 clusters selected randomly
• Each cluster 7 children
• Find all the towns, cities etc. children between 12 to
23 months
• Calculate total number of such children
• Divide by 3o = sampling interval
28. • Select a random number less than or equal to
sampling interval and having same number of
digits as that of the sampling interval +
sampling interval
• The first cluster is the cluster where the total
population is equal to or exceed the sampling
interval is the 1st cluster
29. • Random number + sampling interval = 2nd
cluster
• 2nd cluster + sampling interval = 3rd cluster
• Like wise ……… 3o clusters are completed
30. • To select the 1st house in 1st cluster any
random sampling method can be used
31. • In each cluster door to door study is carried
out to find out if any children is there between
12 to 23 months or not; if yes whether
immunized or not
32. • Contiguous house to house study is done
• This study goes on till 7 children in that cluster
are obtained
• Like wise all the 30 clusters are completed.
33. MULTI STAGE SAMPLING
• In this sampling technique sampling is done in
several stages using simple random technique in
each of the stage
• Example – population is very large and sample size is
very small
• To select 1000 persons from India for a study
• Select 10 persons from each state randomly
• Select 5 persons from each capital radomly
34. • In this method whole sample is considered in
first stage
• In second phase part of a sample is examined
• In patholab a sample of blood is divided into
several samples then studied for different
stages
MULTI PHAGE SAMPLING
35. SAMPLE SIZE
• Factors determining the sample size
– Nature of data
– Study type
– Sampling technique
– Intensity of the problem in data
– Level of confidence
– Accuracy or precision
– Error
– One side test or two side test
– Miscellaneous factor
36. PRESENTATION OF STATISTICAL DATA
• Mode of presentation is more valuable than
the gift.
• If data is not presented systematically then it
is of no use
• It can not be comprehensible
• In bio-statistics raw is of no use
37. HOW TO PRESENT DIFFERENT TYPE OF DATA
• There are certain rules guiding data
presentation
• It is not according to own wish and fancy
38. • TABULAR FORM
• Pictogram
• PIE DIAGRAM
• Bar diagram
• HISTOGRAM
• FREQUENCY POLYGON
• Line diagram
• OGIVE
39. TABLES
TWO TYPES OF TABLES ARE GENERALLY USED
– FREQUENCY DISTRIBUTION TABLE FOR
• QUALITATIVE DATA or ATTRIBUTES
• QUANTITATIVE DATA or VARIABLES
40. TABLE FOR
QUALITATIVE DATA OR ATTRIBUTES
• Suppose we want to draw a table for
SCHOOL CHILDREN ON SEX BASIS
• SEX IS NOT MEASURABLE DATA;
IT IS A QUALITATIVE DATA
41. STUDENTS BY SEX IN A SCHOOL
CHARACTERISTICS POPULATION
Boys 73
Girls 71
42. • Quantitative data requires to be categorized
• Otherwise it is not possible to understand
• Blood pressure of 5000 persons
TABLE FOR QUANTITATIVE DATA OR VARIABLES
43. BLOOD PRESSURE OF 100 PERSONS
AGE GROUP IN
YEARS
SYSTOLIC
[mm/hg]
DIASTOLIC
[mm/hg]
10 to 20 124 68
20 to 30 120 70
30 to 40 138 72
40 to 50 130 74
44. TABLE FOR QUANTITATIVE DATA OR VARIABLES
• First spilt the data in small groups
• Then number of items under each group
• Group should be in ascending or descending
order
• Group interval minimum
45. BAR DIAGRAM
• USED TO PRESENT GRAPHICALLY THE
FREQUENCY OF DIFFERENT CATEGORIES OF
QUALITATIVE DATA
• IT CAN BE VERTICAL OR HORIZONTAL
• IN VERTICAL TYPE Y-AXIS – FREQUENCY
• IN HORIZONTAL TYPE X- AXIS - FREQUENCY
56. MEDIAN
• When data is arranged in ascending or
descending order the middle most value is the
median
• Example – 6,7,7,7,8,9,10
• Here total seven data arranged in ascending
order
60. MODE
• Observation which occurs most frequently in a
series is called as mode
• Example – 5,6,7,7,7,8,9,10
• Here 7 is the mode
61. MEASURES OF DISPERSION
• Height of a group of people
• They vary from persons to person
• But how much they vary?
• Can this variation be measured?
• How?
62. • Measures of dispersion helps to find out how
individual observations are dispersed or
scattered around the mean of a large data.
MEASURES OF DISPERSION
63. Deviation = Observation - Mean
• Different measurements of DISPERSION are
– RANGE
– Mean deviation
– STANDARD DEVIATION
– Variance
– COEFFICIENT OF VARIATION
MEASURES OF DISPERSION
64. RANGE
• Simply the difference between the highest
and the lowest value
• Example – 5 patients no of stay in hospital is
3,4,5,6,7
• Then range is 7-3=4
65. MEAN DEVIATION
• It is the average deviation of observations
from the mean value
66. • Example- incubation period of measles in 7 children
is 10,9,11,7,8,9,9
• Mean deviation:
– No of observations (n)=7
– Total incubation period = 10+9+11+7+8+9+9/7= 63
– Average incubation period = = 63/7=9
– How many days each patient is varying from the average
incubation period (9 days)
MEAN DEVIATION
67. • 1st patient 10-9=1day (1)
• 2nd patient 9-9=0 day (0)
• 3rd patient 11-9= 2days (+2)
• 4th patient 7-9 = 2days (-2)
• 5th patient 8-9 = 1 day (-1)
• 6th Patient 9-9=0 day (0)
• 7th Patient 9-9=0 day (0)
• These are the individual deviations
MEAN DEVIATION
68. MEAN DEVIATION
• Do not consider the plus or minus symbol
• Just add the individual deviations
• (1)+(0)+(2)+(-2)+(-1)+(0)+(0)= 6 = x¯
70. STANDARD DEVIATION (σ)
• Square root of the arithmetic mean of the
square of the deviations taken from arithmetic
mean
• Simply root –mean-square-deviation
72. • In the same example SD will be
• (x-x-)2 = (10)2= 100
• 100/7= 14.28
• √14.28=3.77
73. TEST OF SIGNIFICANCE
• We want to know the effectiveness of a drug
• Then we have prove the drug on some people
• To prove whether it is effective or not we have
to keep another group without giving the drug
also
75. • So we are trying to prove the drug on sample
group then the result shall be generalised
• To generalise the result we have to assume
• This assumption need not be true all the time
• Such assumptions which may be true or may
not be true is called statistical hypothesis
TEST OF SIGNIFICANCE
76. • Statistical hypothesis is a prediction about
parameter
• This can be tested by significant methods
• These tests are know as test of significance
77. • In interpreting the statistical data the observer
is interested to know variability of sample in
comparison to the population
• So observer has to take decision regarding
whether sampling was correct or not
• Regarding transparency of the study
• Any error I n interpreting the data
78. HYPOTHESIS
• Null Hypothesis (H0)
• ex- the height of urban person is more than rural
person
• This is statement ; may be true or false
• How to know?
• Take sample of urban people and rural people and
measure their height
• If the true then you have to accepted the hypothesis;
if false then reject the hypothesis
79. • Hypothesis to be tested without any
difference than the sample is called null
hypothesis - (H0)
• Against which it is tested is called alternative
hypothesis - (H1)
80. LEVEL OF CONFIDENCE
• It is the degree of belief upon the hypothesis
• Usually it is either 95% or 99%
81. Z - test
• The formula
• Z = (x¯- µ) = (x¯- µ)
SE σ/√n
• x¯ - sample mean
• SE – standard error
• µ - population mean
• σ – standard deviation of population
82. LET US DO A PROBLEM
• A cigarette company claimed that its cigarette
contains less than 15 mg of nicotine per
cigarette; claiming to be low risk cigarette
• An NGO found that the average nicotine content
of such cigarette is 16.2mg/cigarette
• And the standard deviation is 3.6mg/cigarette
• Using 0.1 level of significance to prove the either
of the hypothesis
83. • The null hypothesis (H0) = yes it contains less
than 15 mg of nicotine per cigarette; so low
risk cigarette
• i.e. null hypothesis - (H0): µ ≤ 15
84. • Alternative hypothesis - (H1) - No it contains
more than 15 mg nicotine per cigarette; so it is
a risky cigarette
• i.e. alternative hypothesis - (H1): µ ≥ 15
85. SINGLE TAIL TEST
• We are testing the nicotine content; if high
then problem; if low no problem
• So we are doing with one tail test
• But in case of blood pressure if low problem
and if high also problem
• in that case we have to deal with two tail test
86. • When σ – standard deviation of population is
not available then we have to use standard
deviation of sample SD (s) denoted by small
(s)
• Now sample mean - x¯ = s/ √n = 3.6 = 0.255
√200
• Z = 16.2-15 = 4.7
0.255
87. • At 1% level of confidence the standard error
should be less than 2.33
• As here it is 4.7
• So we reject the null hypothesis i.e. we proved
that the nicotine content of such cigarette is
more than the 15 mgs; hence it is a high risk
cigarette
88. ‘t’ test
• Why ‘t’ not any other alphabet
• It was devised by W.S.GOSSET who published
a paper regarding the method known as
‘STUDENT’
• And the last alphabet become the popular
name for such test
89. • ‘z’ test is done where the sample size is large
• ‘t’ test is done where sample size is small
• Or simply speaking the sample size is less than
30
‘t’ test
90. • Why 30 not any other digit?
• It is observed that distributions can be
approximated to population when sample size
is large
• But when sample size is small (smaller than
30) can not be approximated to their
population
• That requires another test i.e. ‘t’ test
91. Let us do example
• A sample was chosen comprising of 12
persons from a population
• Their weight was found to be in following kgs
• 40,45,48,50,55,58,60,60,62,65,68 and 70
• Is the sample drawn from the population
with mean weight of 55 kg?
92. LET US WRITE IN THE LANGUAGE OF STATISTICS
• Null hypothesis – H0 the mean weight of the
population is 55 kg
• Alternative hypothesis H1 – the mean weight
of the population is not 55 kg
93. SYMBOLICALLY
• H0 : µ = 55 kg
• H0 : µ ≠ 55 kg
• Calculation of standard deviation
95. STANDARD DEVIATION
• √ ∑ (X-X¯) 2
n-1
• √ 964.25 = √ 964.25 = √ 87.66 = 9.36 (s)
12-1 11
• t = X¯ - µ
s/√n
Let us put the digits
• X¯ = mean weight = 56.75
• µ = hypothesis = 55
• s = 9.36
• n =12
96. • t = 56.75-55 = 1.75 = 1.75 = 0.65
9.36/√ 12 9.36/3.46 2.71
Degree of freedom (D.F.) = (n-1) (12-1) = 11
Table value for t11 = 2.201 at 5% level
of significance
Our value is 0.65 which is much lower than the t value at
5% level of significance
So we can accept the null hypothesis H0 i.e. the mean
weight of the population is 55 kgs
97. WHAT DOES IT MEAN ?
• It means that the difference between mean
value of the sample and the mean value of the
population from which the sample has been
collected is 0.65 which is lower than the table
value for ‘t’ at 5% level of significance
• The acceptable difference is up to 2.201
98. TO CONCLUDE
• This sample is a perfect representative sample
of its population