1. 2013/05/22
1
STATISTICS
X-Kit Textbook
Chapter 9
Precalculus Textbook
Appendix B: Concepts in Statistics
Par B.2
CONTENT
THE GOAL
Look at ways of summarising a large
amount of sample data in just one or two
key numbers.
Two important aspects of a set of data:
•The LOCATION
•The SPREAD
MEASURES OF CENTRAL TENDENCY
(LOCATION)
Arithmetic Mean (Average)
Mode (the highest point/frequency)
Median (the middle observation)
Number of fraudulent cheques received at a
bank each week for 30 weeks
Week
1
2 3 4 5 6 7 8 9 10
5 3 8 3 3 1 10 4 6 8
Week
11
12 13 14 15 16 17 18 19 20
3 5 4 7 6 6 9 3 4 5
Week
21
22 23 24 25 26 27 28 29 30
7 9 4 5 8 6 4 4 10 4
ARITHMETIC MEAN
• 𝒙 =
𝟏𝟔𝟒
𝟑𝟎
= 𝟓. 𝟒𝟕
• To calculate the MEAN add all the data points
in our sample and divide by die number of
data points (sample size).
• The MEAN can be a value that doesn’t
actually match any observation.
• The MEAN gives us useful information about
the location of our frequency distribution.
2. 2013/05/22
2
GRAPH
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10
Frequency
Frequency
CALCULATE THE MEAN
Raw Data
• 𝑥 =
𝑥
𝑛
• 𝑥 is data
points
• 𝑛 is number
of
observations
Frequency
Table
• 𝑥 =
𝑥𝑓
𝑛
• 𝑥 is data
points
• 𝑛 is number
of
observations
• 𝑓 is the
frequency
Frequency
Table (Intervals)
• 𝑥 =
𝑥𝑓
𝑛
• 𝑥 is midpoints
for intervals
• 𝑛 is number
of
observations
• 𝑓 is the
frequency
CALCULATE THE MEAN - FREQUENCY TABLE:
NUBEROFFRAUDULENT CHEQUESPERWEEK
Distinct Values TallyMarks Frequency
1 / 1
2 0
3 //// 5
4 //// // 7
5 //// 4
6 //// 4
7 // 2
8 /// 3
9 // 2
10 // 2
Truck Data: weights (in tonnes) of 20 fully
loaded trucks
Truck
1
2 3 4 5 6 7 8 9 10
Weight
4.54
3.81 4.29 5.16 2.51 4.63 4.75 3.98 5.04 2.80
Truck
11
12 13 14 15 16 17 18 19 20
Weight
2.52
5.88 2.95 3.59 3.87 4.17 3.30 5.48 4.26 3.53
CALCULATE THE MEAN - GROUPED
FREQUENCY TABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
Class Intervals Frequency Midpoint
𝟐. 𝟓 ≤ 𝒙 ≤ 𝟑. 𝟎 4 𝟐. 𝟓 + 𝟑. 𝟎 ÷ 𝟐 = 2.75
𝟑. 𝟎 < 𝒙 ≤ 𝟑. 𝟓 1 3.25
𝟑. 𝟓 < 𝒙 ≤ 𝟒. 𝟎 5 3.75
𝟒. 𝟎 < 𝒙 ≤ 𝟒. 𝟓 3 4.25
𝟒. 𝟓 < 𝒙 ≤ 𝟓. 𝟎 3 4.75
𝟓. 𝟎 < 𝒙 ≤ 𝟓. 𝟓 3 5.25
𝟓. 𝟓 < 𝒙 ≤ 𝟔. 𝟎 1 5.75
MODE
•The mode is the interval with the
HIGHEST FREQUENCY.
•There can be two or more modes in a set
of data – then the mode would not be a
good measure of central tendency.
•MULTI-MODAL data consist of more than
one mode.
•UNI-MODAL data consist of only one
mode.
4. 2013/05/22
4
DON’T FALL INTO THE COMMON TRAP
• The median is NOT the middle of the range of
observations, for example
1, 1, 1, 1, 1, 3, 9
The median is 1 (the middle observation).
The middle of the range (9 – 1) is 5! Big
difference!
MEDIAN
Odd Number of
Observations,
for example 7
Median Position
𝒏+𝟏
𝟐
Even Number of
Observations,
for example30
Median Position
half-way between
𝒏
𝟐
𝒂𝒏𝒅 (
𝒏
𝟐
+ 𝟏)
FINDTHE MEDIAN -FREQUENCYTABLE:
NUBER OF FRAUDULENT CHEQUES PERWEEK
Distinct Values Frequency Cumulative
Frequency
1 1 1
2 0 1
3 5 6
4 7 13
5 4 17
6 4 21
7 2 23
8 3 26
9 2 28
10 2 30
FIND THE MEDIAN - GROUPED FREQUENCY
TABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
ClassIntervals Frequency Midpoint
𝟐. 𝟓 ≤ 𝒙 ≤ 𝟑. 𝟎 4 𝟐. 𝟓 + 𝟑. 𝟎 ÷ 𝟐 = 2.75
𝟑. 𝟎 < 𝒙 ≤ 𝟑. 𝟓 1 3.25
𝟑. 𝟓 < 𝒙 ≤ 𝟒. 𝟎 5 3.75
𝟒. 𝟎 < 𝒙 ≤ 𝟒. 𝟓 3 4.25
𝟒. 𝟓 < 𝒙 ≤ 𝟓. 𝟎 3 4.75
𝟓. 𝟎 < 𝒙 ≤ 𝟓. 𝟓 3 5.25
𝟓. 𝟓 < 𝒙 ≤ 𝟔. 𝟎 1 5.75
FIND THE MEDIAN FROM A GROUPED
FREQUENCY TABLE
•Median (middle observation)?
•Find the class interval in which that
observation lies.
?
CALCULATIONS
Raw Data
Mean
Mode
Median
Frequency Table
(Ungrouped
Data)
Mean
Mode
Median
Frequency Table
(Grouped Data)
Mean
Mode
Median
5. 2013/05/22
5
HOW TO CHOOSE THE BEST MEASURE OF
LOCATION?
• When choosing the best measure of location, we
need to look as the SHAPE of the distribution.
• For nearly symmetric data, the mean is the best
choice.
• For very skewed (asymmetric) data, the mode or
median is better.
• The mean moves further along the tail than the
median, it is more sensitive to the values far from
the centre.
SYMMETRIC histogram:
Mean = Median = Mode
A POSITIVELY SKEWED (skewed to the right)
histogram has a longer tail on the right side:
Mode < Median < Mean
A NEGATIVELY SKEWED (skewed to the left)
histogram has a longer tail on the left side:
Mean < Median < Mode
PROBLEM
•We can find two very different data sets (one
distribution very spread out and another very
concentrated) with measures of central
tendency EQUAL.
•To find a true idea of our sample, we have to
MEASURE THE SPREAD OF A DISTRIBUTION,
called the spread dispersion.
MEASURESOF SPREAD(DISPERSION)
Interquartile Range
Variance
Standard Deviation
6. 2013/05/22
6
MEASURINGSPREAD
•Think of a distribution in terms of
percentages, a horizontal axis equally divided
into 100 percentiles.
•The 10th percentile marks the point below
which 10% of the observations fall, and
above which 90% of observations fall.
•The 50th percentile, below which 50% of the
observations lie, is the median.
WORKINGWITH A PERCENTILE
• 𝑝% of the observationfall belowthe 𝑝 𝑡ℎ percentile.
𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 =
𝒑
𝟏𝟎𝟎
𝒏 + 𝟏
• Workingwith the example on fraudulentcheques:
1, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6,
7, 7, 8, 8, 8, 9, 9, 10, 10
𝑷 𝟓𝟎 =
𝟓𝟎
𝟏𝟎𝟎
𝟑𝟎 + 𝟏 = 𝟏𝟓. 𝟓
• 15.5 tells us where to find our 50th percentile.
• 15 tells us which observation to go to, and 0.5 tells us how far to
move along the space between that observation and the next
highest one.
FORMULA
• 𝑷 𝟓𝟎 = 𝒙 𝟏𝟓 + 𝟎. 𝟓 𝒙 𝟏𝟔 − 𝒙 𝟏𝟓
𝑷 𝒑 = 𝒙 𝒌 + 𝒅 𝒙 𝒌+𝟏 − 𝒙 𝒌
• 𝑃 means percentile
• 𝑝 tell us which percentile
• 𝑘 the whole number calculated from the
position
• 𝑑 the decimal fraction calculated from the
position
WORKINGWITH PERCENTILESFROMUNGROUPEDFREQUENCYDATA:
NUBEROFFRAUDULENT CHEQUESPERWEEK
Distinct Values Frequency Cumulative Frequency
1 1 1
2 0 0 + 1 = 1
3 5 1 + 5 = 6
4 7 6 + 7 = 13
5 4 13 + 4 = 17
6 4 17 + 4 = 21
7 2 21 + 2 = 23
8 3 23 + 3 = 26
9 2 26 + 2 = 28
10 2 28 + 2 = 30
WORKING WITH PERCENTILES (AND
MEDIAN) FROM GROUPED DATA
• To identify the class interval 𝑳 < 𝒙 ≤ 𝑼 containing the
𝑝 𝑡ℎ percentile:
𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 =
𝒑
𝟏𝟎𝟎
𝒏 + 𝟏
• The decimal fraction for grouped data is:
𝒅 =
𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏−𝑺𝒖𝒎 𝒐𝒇 𝒄𝒍𝒂𝒔𝒔 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒊𝒆𝒔 𝒕𝒐 𝑳
𝑭𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚 𝒐𝒇 𝒄𝒍𝒂𝒔𝒔 𝑳 < 𝒙 ≤ 𝑼
• Calculate the 𝑝 𝑡ℎ percentile:
𝑷 𝒑 ≈ 𝑳 + 𝒅 𝑼 − 𝑳
FIND THE MEDIAN - GROUPED FREQUENCY
TABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
Class Intervals Frequency CumulativeFrequency
𝟐. 𝟓 ≤ 𝒙 ≤ 𝟑. 𝟎 4 4
𝟑. 𝟎 < 𝒙 ≤ 𝟑. 𝟓 1 5
𝟑. 𝟓 < 𝒙 ≤ 𝟒. 𝟎 5 10
𝟒. 𝟎 < 𝐱 ≤ 𝟒. 𝟓 3 13
𝟒. 𝟓 < 𝒙 ≤ 𝟓. 𝟎 3 16
𝟓. 𝟎 < 𝒙 ≤ 𝟓. 𝟓 3 19
𝟓. 𝟓 < 𝒙 ≤ 𝟔. 𝟎 1 20
7. 2013/05/22
7
FIND THEMEDIAN-GROUPEDFREQUENCYTABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
• To identify the class interval 𝟒. 𝟎 < 𝒙 ≤ 𝟒. 𝟓 containing
the 50 𝑡ℎ percentile:
𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 =
𝟓𝟎
𝟏𝟎𝟎
𝟐𝟎 + 𝟏 = 𝟏𝟎. 𝟓
• The decimal fraction for grouped data is:
𝒅 =
𝟏𝟎.𝟓 − 𝟏𝟎
𝟑
=
𝟏
𝟔
• Calculate the 𝑝 𝑡ℎ percentile:
𝑷 𝟓𝟎 ≈ 𝟒. 𝟎 + 𝒅 𝟒. 𝟓 − 𝟒. 𝟎 = 𝟒. 𝟎𝟖𝟑𝟑𝟑
MEASURINGSPREAD
• If we measure the DIFFERENCE in value between
one percentile and another, this would give us an
idea of how widely our data is spread out.
• INTERQUARTILE RANGE (IQR) = 75th – 25th Percentiles
• The bigger the IQR, the more spread out the data.
• The 75th percentile ≥ 25th percentile, therefor the
IQR ≥ 0 .
• We tend to use the MEDIAN (as measure of
central tendency) together with the IQR.
FIND THE IQR - GROUPED FREQUENCY
TABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
ClassIntervals Frequency CumulativeFrequency
𝟐. 𝟓 ≤ 𝒙 ≤ 𝟑. 𝟎 4 4
𝟑. 𝟎 < 𝒙 ≤ 𝟑. 𝟓 1 5
𝟑. 𝟓 < 𝒙 ≤ 𝟒. 𝟎 5 10
𝟒. 𝟎 < 𝒙 ≤ 𝟒. 𝟓 3 13
𝟒. 𝟓 < 𝒙 ≤ 𝟓. 𝟎 3 16
𝟓. 𝟎 < 𝒙 ≤ 𝟓. 𝟓 3 19
𝟓. 𝟓 < 𝒙 ≤ 𝟔. 𝟎 1 20
FIND THEMEDIAN-GROUPEDFREQUENCYTABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
• To identify the class interval 𝟒. 𝟓 < 𝒙 ≤ 𝟓. 𝟎 containing
the 75 𝑡ℎ percentile:
𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 =
𝟕𝟓
𝟏𝟎𝟎
𝟐𝟎 + 𝟏 = 𝟏𝟓. 𝟕𝟓
• The decimal fraction for grouped data is:
𝒅 =
𝟏𝟓. 𝟕𝟓 − 𝟏𝟑
𝟑
= 𝟎. 𝟗𝟏𝟕
• Calculate the 𝑝 𝑡ℎ percentile:
𝑷 𝟕𝟓 ≈ 𝟒. 𝟓 + 𝒅 𝟓. 𝟎 − 𝟒. 𝟓 = 𝟒. 𝟗𝟓𝟖
FIND THEMEDIAN-GROUPEDFREQUENCYTABLE:
TruckData: weights(intonnes)of20fullyloadedtrucks
• To identify the class interval 𝟑. 𝟓 < 𝒙 ≤ 𝟒.0 containing
the 25 𝑡ℎ percentile:
𝑷𝒐𝒔𝒊𝒕𝒊𝒐𝒏 =
𝟐𝟓
𝟏𝟎𝟎
𝟐𝟎 + 𝟏 = 𝟓. 𝟐𝟓
• The decimal fraction for grouped data is:
𝒅 =
𝟓. 𝟐𝟓 − 𝟓
𝟓
= 𝟎. 𝟎𝟓
• Calculate the 𝑝 𝑡ℎ percentile:
𝑷 𝟐𝟓 ≈ 𝟑. 𝟓 + 𝒅 𝟒. 𝟎 − 𝟑. 𝟓 = 𝟑. 𝟓𝟐𝟓
• IQR = 4.958 – 3.525 = 1.433
MEASURINGSPREAD
• When we use the MEAN as our measure of central
tendency, we usually choose A MEASURE OF HOW FAR
THE DATA IS SPREAD OUT AROUND THE MEAN.
• Two measures of spread that are based on the mean are
the VARIANCE and the STANDARD DEVIATION.
• An advantage of standard deviation is that it is measured
in the same units as the original observations.
• The variance and standard deviation are closely related.
• The variance (𝒔 𝟐 or 𝝈 𝟐) is the square of the standard
deviation (𝒔 or 𝝈).