2. Chapter 3:
Describing, Exploring, and Comparing Data
3.1 Measures of Center
3.2 Measures of Variation
3.3 Measures of Relative Standing and Boxplots
2
Objectives:
1. Summarize data, using measures of central tendency, such as the mean, median, mode,
and midrange.
2. Describe data, using measures of variation, such as the range, variance, and standard
deviation.
3. Identify the position of a data value in a data set, using various measures of position,
such as percentiles, deciles, and quartiles.
4. Use the techniques of exploratory data analysis, including boxplots and five-number
summaries, to discover various aspects of data
3. Recall: 2.1 Frequency Distributions for Organizing and Summarizing Data
Data collected in original form is called raw data.
Frequency Distribution (or Frequency Table)
A frequency distribution is the organization of raw data in table form, using
classes and frequencies. It Shows how data are partitioned among several
categories (or classes) by listing the categories along with the number
(frequency) of data values in each of them.
Nominal- or ordinal-level data that can be placed in categories is organized in
categorical frequency distributions.
Lower class limits: The smallest numbers that can belong to each of the
different classes
Upper class limits: The largest numbers that can belong to each of the
different classes
Class boundaries: The numbers used to separate the classes, but without the
gaps created by class limits
Class midpoints: The values in the middle of the classes Each class midpoint
can be found by adding the lower class limit to the upper class limit and
dividing the sum by 2.
Class width: The difference between two consecutive lower class limits in a
frequency distribution
Procedure for Constructing a
Frequency Distribution
1. Select the number of classes,
usually between 5 and 20.
2. Calculate the class width: 𝑊 =
𝑀𝑎𝑥−𝑀𝑖𝑛
# 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
and round up
accordingly.
3. Choose the value for the first
lower class limit by using either
the minimum value or a
convenient value below the
minimum.
4. Using the first lower class limit
and class width, list the other
lower class limits.
5. List the lower class limits in a
vertical column and then
determine and enter the upper
class limits.
6. Take each individual data value
and put a tally mark in the
appropriate class. Add the tally
marks to get the frequency.
3
4. Normal Distribution:
Because this histogram is
roughly bell-shaped, we say that
the data have a normal
distribution.
4
Skewness
A distribution of
data is skewed if
it is not
symmetric and
extends more to
one side than
to the other.
Data skewed to the right
(positively skewed) have a
longer right tail.
Data skewed to the left
(negative skewed) have
a longer left tail.
Recall: 2.2 Histograms
Histogram: A graph consisting of bars of equal width drawn
adjacent to each other (unless there are gaps in the data)
The horizontal scale represents classes of quantitative data
values, and the vertical scale represents frequencies. The
heights of the bars correspond to frequency values.
Important Uses of a Histogram
Visually displays the shape of the distribution of the data
Shows the location of the center of the data
Shows the spread of the data & Identifies outliers
5. Dotplots & its Features: A graph of quantitative data in which each data value is plotted as a point (or dot) above a
horizontal scale of values. Dots representing equal values are stacked.
Displays the shape of distribution of data.It is usually possible to recreate the original list of data values.
Stemplots (or stem-and-leaf plot)
Represents quantitative data by separating each value into two parts: the stem (such as the leftmost digit) and the leaf (such as the rightmost
digit).
Time-Series Graph
A graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly
Bar Graphs
A graph of bars of equal width to show frequencies of categories of categorical (or qualitative) data. The bars may or
may not be separated by small gaps.
Pareto Charts
A Pareto chart is a bar graph for categorical data, with the added condition that the bars are arranged in descending
order according to frequencies, so the bars decrease in height from left to right.
Pie Charts
A very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to
the frequency count for the category
Feature of a Pie Chart
Shows the distribution of categorical data in a commonly used format.
Frequency Polygon
A graph using line segments connected to points located directly above class midpoint values
A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars.
An Ogive is a line graph that depicts cumulative frequencies
Recall: 2.3 Graphs that Enlighten and Graphs that Deceive
5
Graphs that Enlighten Graphs that Deceive
Nonzero Vertical Axis
A common deceptive graph involves
using a vertical scale that starts at some
value greater than zero to exaggerate
differences between groups.
Pictographs
Drawings of objects, called
pictographs, are often misleading.
Data that are one-dimensional in nature
(such as budget amounts) are often
depicted with two-dimensional objects
(such as dollar bills) or three-
dimensional objects (such as stacks of
coins, homes, or barrels).
By using pictographs, artists can create
false impressions that grossly distort
differences by using these simple
principles of basic geometry:
When you double each side of a square,
its area doesn’t merely double; it
increases by a factor of four.
When you double each side of a cube,
its volume doesn’t merely double; it
increases by a factor of eight.
6. Recall: 2.4 Scatterplots, Correlation, and Regression
6
Linear Correlation Coefficient r
The linear correlation coefficient is denoted by r, and it measures the strength of the linear association between
two variables.
The computed value of the linear correlation coefficient, r, is always between −1 and 1.
If r is close to −1 or close to 1, there appears to be a correlation.
If r is close to 0, there does not appear to be a linear correlation.
Scatterplot and Correlation
Correlation
A correlation exists between two variables when the values of one variable are somehow associated with the
values of the other variable.
Linear Correlation
A linear correlation exists between two variables when there is a correlation and the plotted points of paired
data result in a pattern that can be approximated by a straight line.
Scatterplot (or Scatter Diagram)
A scatterplot (or scatter diagram) is a plot of paired (x, y) quantitative data with a horizontal x-axis and a
vertical y-axis. The horizontal axis is used for the first variable (x), and the vertical axis is used for the second
variable (y).
7. Key Concept
Obtain a value that measures the center of a data set.
Present and interpret measures of center, including mean and median.
3.1 Measures of Center
Measure of Center (Central Tendency)
A measure of center is a value at the center or middle of a data set.
1. Mean
2. Median
3. Mode
4. Midrange
5. Weighted Mean 7
8. A measure of center is a value at the center or middle of a data set.
The Mean (or Arithmetic Mean) of a set of data is the measure of center found by
adding all of the data values and dividing the total by the number of data values.
Important Properties of the Mean
• Sample means drawn from the same population tend to vary less than other measures
of center.
• The mean of a data set uses every data value.
• A disadvantage of the mean is that just one extreme value (outlier) can change the
value of the mean substantially. (Using the following definition, we say that the mean
is not resistant.)
Resistant
A statistic is resistant if the presence of extreme values (outliers) does not cause it
to change very much.
3.1 Measures of Center
8
9. Notation
∑ denotes the sum of a set of data values.
x is the variable usually used to represent the individual data values.
n represents the number of data values in a sample.
N represents the number of data values in a population.
3.1 Measures of Center
µ is pronounced “mu” and is the mean of all values in a population.
𝜇 =
𝑥
𝑁
The symbol 𝑥 (x-bar) is used for sample mean.
𝑥 =
𝑥
𝑛
1 2 3 n
XX X X X
X
n n
1 2 3 N
XX X X X
N N
9
10. The data represent the number of days off per year for a sample of
individuals selected from nine different countries. Find the mean.
20, 26, 40, 36, 23, 42, 35, 24, 30
3.1 Measures of Center 1 2 3 n
XX X X X
X
n n
20 26 40 36 23 42 35 24 30
9
X
276
30.7
9
days
Example 1
10
11. Median
Median (MD): 𝒙 (x-Tilde)
The median of a data set is the measure of center that is the middle value when the
original data values are arranged in order of increasing (or decreasing) magnitude. In
other words, the median is the midpoint of the data array.
Properties:
The median does not change by large amounts when we include just a few extreme
values, so the median is a resistant measure of center.
The median does not directly use every data value. (For example, if the largest value
is changed to a much larger value, the median does not change.)
3.1 Measures of Center
To find the median:
Firs sort the data in ascending order:
1. If the number of data values is odd, the median is the number located in the exact middle of the sorted list.
2. If the number of data values is even, the median is found by computing the mean of the two middle numbers
in the sorted list.
11
12. Example 2
Given the data speeds: 38.5, 55.6, 22.4, 14.1, and 23.1 (all in megabits per
second, or Mbps), find
a. The mean b. The median
1 2 3 n
XX X X X
X
n n
38.5 55.6 22.4 14.1 23.1
.
5
a X
153.7
30.74
5
Mbps
The median is: 23.1 Mbps.
. Sort:
14.1,22.4,23.1,38.5,55.6
b
12
13. Mode
The mode (sometimes called the most typical case) of a data set is the value(s) that occur(s) with the greatest
frequency.
The mode can be found with qualitative data. When no data value is repeated, we say that there is no mode. There
may be no mode, one mode (unimodal), two modes (bimodal), or many modes (multimodal).
Midrange
The midrange of a data set is the measure of center that is the value midway between the maximum and minimum
values in the original data set. 𝑴𝒓 =
𝑴𝒊𝒏+𝑴𝒂𝒙
𝟐
Because the midrange uses only the maximum and minimum values, it is very sensitive to those extremes so the
midrange is not resistant.
In practice, the midrange is rarely used, but it has three features:
1. The midrange is very easy to compute.
2. The midrange helps reinforce the very important point that there are several different ways to define the
center of a data set.
3. The value of the midrange is sometimes used incorrectly for the median, so confusion can be reduced by
clearly defining the midrange along with the median.
Mode, Midrange3.1 Measures of Center
13
14. Example 3
Given the data speeds: 38.5, 55.6, 22.4, 14.1, 23.1, 24.5 (all in megabits per
second, or Mbps), find
a. The mean b. The median c. The mode d. The midrange
1 2 3 n
XX X X X
X
n n
38.5 55.6 22.4 14.1 23.1 24.5
.
6
a X
178.2
29.7
6
Mbps
. Sort:
14.1,22.4,23.1,24.5,38.5,55.6
b
23.1 24.5
23.80
2
x
c. There is no mode, why?
.
2
Min Max
d MR
14.1 55.6
34.85
2
Mbps
14
𝑀𝑅 =
𝑀𝑖𝑛 + 𝑀𝑎𝑥
2
15. Example 4
a. Find the mode of these data speeds (in Mbps):
1 2 3 n
XX X X X
X
n n
The mode is 0.3 Mbps, because it is the data speed occurring most often
(three times).
15
b. Mode? 0.3, 0.3, 0.6, 4.0, 4.0
c. Mode? 0.3, 1.1, 2.4, 4.0, 5.0
Two modes: 0.3
Mbps and 4.0 Mbps.
No mode because no
value is repeated.
0.2, 0.3, 0.3, 0.3, 0.6, 0.6, 0.7, 1.2
16. Rounding Rule:3.1 Measures of Center
The mean, median, and midrange should be rounded to one more
decimal place than occurs in the raw data.
The mean, in most cases, is not an actual data value.
For the mode, leave the value as is without rounding (because values of
the mode are the same as some of the original data values).
Caution
Never use the term average when referring to a measure of center. The word
average is often used for the mean, but it is sometimes used for other
measures of center.
The term average is not used by statisticians, the statistics community, or
professional journals.
16
17. Example 5
Find the mean.
17
Class
Boundaries
Frequency
5.5 - 10.5
10.5 - 15.5
15.5 - 20.5
20.5 - 25.5
25.5 - 30.5
30.5 - 35.5
35.5 - 40.5
1
2
3
5
4
3
2
n = f = 20
Mean from a Frequency Distribution:
Multiply each frequency and class midpoint,
add the products, and divide by sample size.
mf X
X
n
Midpoint
Xm
8
13
18
23
28
33
38
8
26
54
115
112
99
76
f ·Xm
f ·Xm = 490
490
20
mf X
X
n
Calculating the Mean from a Frequency Distribution
Mean from a Frequency
Distribution:
Calculations is made by
pretending that all sample values
in each class are equal to the class
midpoint.
24.5
18. Use the frequency distribution to find the mean.
Time (seconds) Frequency f Class Midpoint x f · x
75 – 124 11
125 – 174 24
175 – 224 10
225 – 274 3
275 – 324 2
Totals: ∑f = 50
18
Example 6 mf X
X
n
Calculating the Mean from a Frequency Distribution
99.5
149.5
199.5
249.5
299.5
1094.5
3588.0
1995.0
748.5
599.0
∑(f · x) = 8025.0
mf X
X
n
8025
160.5
50
The result of x = 160.5 is an
approximation because it is based
on the use of class midpoint values
instead of the original data.
19. Example 7
Find the modal class for the frequency distribution.
19
Class
Boundaries
Frequency
5.5 - 10.5
10.5 - 15.5
15.5 - 20.5
20.5 - 25.5
25.5 - 30.5
30.5 - 35.5
35.5 - 40.5
1
2
3
5
4
3
2
n = f = 20
mf X
X
n
Midpoint
Xm
8
13
18
23
28
33
38
Calculating the Mean from a Frequency Distribution
The modal class is
20.5 – 25.5.
The mode, the midpoint
of the modal class:
(20.5 + 25.5) / 2 = 23
20. When different x data values are assigned different weights w, we can compute a
weighted mean.
To find the Weighted Mean of a variable:
Multiply each value by its corresponding weight and divide the sum of the
products by the sum of the weights.
20
Weighted Mean3.1 Measures of Center
1 1 2 2
1 2
n n
n
wXw X w X w X
X
w w w w
Example 8: Grade-Point Average
Find a GPA for a college student who took 5 courses and received:
A (3 credits), A (4 credits), B (3 credits), C (3 credits), and F (1 credit).
The grading system assigns quality points to letter grades as follows: A = 4; B = 3; C =
2; D = 1; F = 0.
3(4) 4(4) 3(3) 3(2) 1(0)
3 4 3 3 1
X
Use the numbers of credits as weights: w = 3, 4, 3, 3, 1.
Replace the letter grades of A, A, B, C, and F with the
corresponding value quality points: x = 4, 4, 3, 2, 0.
43
3.07
14
21. 21
Weighted Mean3.1 Measures of Center
1 1 2 2
1 2
n n
n
wXw X w X w X
X
w w w w
Example 9: Grade-Point Average
A student received the following grades. Find the corresponding GPA.
Course Credits, w Grade, X
English Composition 3 A (4 points)
Introduction to Psychology 3 C (2 points)
Biology 4 B (3 points)
Physical Education 2 D (1 point)
wX
w
X
3 4 3 2 4 3 2 1
3 3 4 2
32
2.67
12
22. Properties of the Mean
Uses all data values.
Varies less than the median or mode
Used in computing other statistics, such
as the variance
Unique, usually not one of the data
values
Cannot be used with open-ended classes
Affected by extremely high or low
values, called outliers
3.1 Measures of Center
22
Properties of the
Midrange
Easy to compute.
Gives the midpoint.
Affected by extremely
high or low values in a
data set
23. Properties of the Median
Gives the midpoint
Used when it is necessary to find
out whether the data values fall into
the upper half or lower half of the
distribution.
Can be used for an open-ended
distribution.
Affected less than the mean by
extremely high or extremely low
values.
3.1 Measures of Center
23
Properties of the Mode
Used when the most
typical case is desired
Easiest average to
compute
Can be used with
nominal data
Not always unique or
may not exist