4. Learning objectives
1. Explain what prevalence and incidence are.
2. Explain what a summary measure of location is, and
show that you understand the meaning of, and the
difference between, the mode, the median and the
mean.
3. Be able to calculate the mode, median and mean for a
set of values.
4. Explain what a percentile is, and calculate any given
percentile value.
5. Explain what a summary measure of spread is, and show
that you understand the difference between, and can
calculate, the range, the interquartile range and the
standard deviation.
5.
6. Numbers, percentages and proportions
• When you present the results of an investigation, you
will almost certainly need to give the numbers of the
subjects involved; and perhaps also provide values for
percentages.
• It is usually categorical data that are summarized with
a value for percentage or proportion.
7. Prevalence and the incidence rate
When suitable we can also summarize data by providing a
value for the prevalence or the incidence rate of some
condition.
• Prevalence of a disease is the number of existing cases
in some population at a given time. In practice, the period
prevalence is more often used.
• i.e. the prevalence of Breast Cancer in women in a place
in 2010 was 3.1%. The prevalence figure will include
existing cases, i.e. those who contracted the disease
before 2010, and still had it, as well as those first
getting the disease in 2010.
8.
9. Incidence or inception rate of a disease is the number
of new cases occurring per 1000, or per 10 000, of the
population , during a given period, usually 12 months.
10. Summary measures of location
A summary measure of location is a value around which
most of the data values tend to congregate or center.
There are three measures of location
• Mode
• Median
• Mean
11. Mode
• The mode is that category or value in the data that has
the highest frequency (i.e. occurs the most often). In this
sense, the mode is a measure of common-ness or
typical-ness.
• The mode is not particularly useful with metric
continuous data where no two values may be the same.
The other deficiency of this measure is that there may be
more than one mode in a set of data.
Patients Number of inhaler use in last 24 hours
A 5
B 12
C 10
12. Median
• If we arrange the data in ascending order of size, the
median is the middlemost number in the set. Thus, half
of the values will be equal to or less than the
median value, and half equal to or above it. The
median is thus a measure of central-ness.
• i.e. Age (in ascending order of years), for 5 individuals:
30 31 32 33 35. The middle value is 32, so the median
age for these 5 people is 32 years.
13. • Another way of determining the value of the median, If you
have “n” values arranged in ascending order, then: the
median = 1 / 2(n + 1)th value.
• i.e., if the ages of six people are: 30 31 32 33 35 36, then n
= 6, therefore:
• 1 / 2(n + 1) = 1 / 2 × (6 + 1) = 1 / 2 × 7 = 3.5
• Then, median is the 3.5th value. That is, it is the value half
way between the 3rd value of 32, and the 4th value of 33,
or 32.5 years, which is the same result as before.
• An advantage of the median is that it is not much affected
by skewness in the distribution, or by the presence of
outliers. However, it discards a lot of information, because it
ignores most of the values, apart from those in the center
of the distribution.
14. Mean
• The mean, or the arithmetic mean to give it its full name,
is more commonly known as the average.
• One advantage of the mean over the median is that it
uses all of the information in the data set.
• However, it is affected by skewness in the distribution,
and by the presence of outliers in the data.
• This may, on occasion, produce a mean that is not very
representative of the general mass of the data.
• Moreover, it cannot be used with ordinal data.
15. Percentiles
• A percentile (or a centile) is a measure used in statistics
indicating the value below which a given percentage of
observations in a group of observations fall. For example,
the 20th percentile is the value (or score) below which 20
percent of the observations may be found.
• Percentiles are the values which divide an ordered set of
data into 100 equal-sized groups.
• Notice that this makes the median the 50th percentile,
since it divides the data values into two equal halves, 50
per cent above the median and 50 per cent below.
16. Choosing the most appropriate measure
• How do you choose the most appropriate measure of
location for some given set of data?
• The main thing to remember is that the mean cannot be
used with ordinal data (because they are not real
numbers), and that the median can be used for both
ordinal and metric data (particularly when the latter is
skewed).
Type of variable Summary measure of location
Mode Median Mean
Nominal Yes Yes No
Ordinal Yes No No
Metric Discrete Yes Yes, if distribution Yes
Metric Continuous No Is markedly skewed Yes
Choosing an appropriate measure of location
17. Summary measures of spread
As well as a summary measure of location, a summary
measure of spread or dispersion can also be very useful.
There are three main measures in common use
• Range
• Interquartile range
• Standard Deviation
18. Range
• The range is the distance from the smallest value to the
largest. The range is not affected by skewness, but is
sensitive to the addition or removal of an outlier value. i.e,
the range of the 30 birth weights is (2.86 – 4.49 kg).
• The range is best written like this, rather than as the
single-valued difference, i.e. as 1.6 kg, in this example,
which is much less informative.
• The range can sometimes be misleading when there are
extremely high or low values.
19. The interquartile range (iqr)
• One solution to the problem of the sensitivity of the range to
extreme value (outliers) is to remove a quarter (25 %) of the
values off both ends of the distribution (which removes any
troublesome outliers), and then measure the range of the
remaining values. This distance is called the interquartile
range, or iqr.
• The interquartile range is not affected either by outliers or
skewness, but it does not use all of the information in the
data set since it ignores the bottom and top quarter of
values.
20.
21. Standard Deviation
The Standard Deviation is a measure of how spread out
numbers are.
Its symbol is σ (the Greek letter sigma)
The formula is easy: it is the square root of the Variance.
So now you ask, "What is the Variance?“
Variance
The Variance is defined as:
The average of the squared differences from the Mean.
22. You and your friends have just measured the heights of
your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm,
170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard
Deviation.
Your first step is to find the Mean
Mean =
600 + 470 +
170 + 430 +
300 =
1970
= 394
5 5
23. So the mean (average) height is 394 mm. Let's plot
this on the chart:
24. To calculate the Variance, take each difference, square it,
and then average the result:
Now we calculate each dog's difference from the Mean:
So, the Variance is 21,704.
25. And the Standard Deviation is just the square root of
Variance, so:
Standard Deviation: σ = √21,704 = 147.32... = 147
(to the nearest mm)
And the good thing about the Standard Deviation is that
it is useful. Now we can show which heights are within
one Standard Deviation (147mm) of the Mean
So, using the Standard Deviation we have a "standard"
way of knowing what is normal, and what is extra large
or extra small.
26. • The smaller this mean distance is, the narrower the
spread of values must be, and vice versa.
• This idea is the basis for what is known as the standard
deviation, or SD
27.
28. Type of variable Summary measure of location
Range Interquartile range Standard deviation
Nominal No No No
Ordinal Yes Yes No
Metric Yes Yes, if skewed Yes
Choosing an appropriate measure of spread