Intro to Data Analysis and Descriptive Statistics FD 502 Presentation.pdf

CHAPTER 2
Introduction to Data Analysis in
MS Excel and SPSS
FD 502 - Basic Statistics
Christian G. Abalos
Presenter

Table
and
Graphs
Measures of
Central
Tendency
Measures
of Relative
Position
Measures of
Variability
Skewness and
Kurtosis
Introduction
to SPSS
Introduction to
Statistical Tools in MS
Excel and SPSS
Frequency
Distribution

Targets
1. Construct frequency distribution table
using SPSS.
2. Present MS Excel and SPSS results in
graphs or tables using the APA format
and;
3. Use MS Excel and SPSS in computing for
descriptive statistics and;
4. Interpret MS Excel and SPSS results.
Basic Statistics FD 502

Basic Statistics FD 502
Data Analysis with Excel is a comprehensive tutorial
that provides a good insight into the latest and
advanced features available in Microsoft Excel. It
explains in detail how to perform various data
analysis functions using the features available in MS-
Excel.
What is Data Analysis ToolPack in Excel?

Introduction to Data Analysis FD 502
Data Analysis ToolPack in MS Excel
These instructions apply only to Excel 2010 and present versions.
• Click the File tab, click Options, and then click Add-ins category.
• In the Manage box, select Excel Add-ins and the click Go.
• In the Add-ins available box, select the Analysis ToolPak check box
and then click OK.
Tip:
If Analysis ToolPack is not listed Add-Ins available check box, click
Browse to locate it.
If you are prompted that the Analysis ToolPack is not currently
installed on your computer, click Yes to install it.

Features of Data Analysis ToolPak in MS Excel

Sample Data for Exploring Data
Analysis Tool Pack

IBM Statistical Package for Social Sciences (SPSS)

Features of IBM Statistical Package for Social
Sciences (SPSS)
• Standard Package
• Data Access and
Management
• Data Preparation
• Graphs
• Output
• Data Editor Enhancement
• Extended Programmability
• Statistics
• Multi Threaded Algorithm
• Bootstrapping
• Regression
• Advanced Statistics

Data Creation in IBM SPSS

Sample Data in Exploring IBM SPSS

Frequency Distribution
A frequency distribution is the organization of
raw data in table form, using classes and
frequencies.
There are three basic types of frequency
distributions. The three types are categorical,
ungrouped and grouped frequency distributions.

Frequency Distribution FD 502
Categorical Frequency Distribution
The categorical frequency distribution is used
for data that can be placed in specific
categories, such as nominal- or ordinal-level
data.
For example, data such as political affiliation,
religious affiliation, or major field of study.

Twenty-five army inductees were given a blood
test to determine their blood type. The data set is as
follows:
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
Example

The categorical frequency distribution is
Blood Type Frequency Percent
A 5 20
B 7 28
O 9 36
AB 4 16
N = 25 100

Ungrouped Frequency Distribution
An ungrouped frequency
distribution is used for numerical data
and when the range (the difference between
the highest and the smallest values) is small.

Example

Class Limits (in miles) Frequency Percentage
12 6 20
13 1 3
14 3 10
15 6 20
16 8 27
17 2 7
18 3 10
19 1 3
N = 30 100
The ungrouped frequency distribution is

Grouped Frequency Distribution
When the range of the data is large, the data
must be grouped into classes that are more than one
unit in width.
To construct a frequency distribution, follow these
rules:
1. There should be between 5 and 20 classes.
2. The class width should be an odd number. This
ensures that the midpoint of each class has the
same place value as the data.

3. The classes must be mutually
exclusive. Mutually exclusive classes have
nonoverlapping class limits so that data cannot be
placed into two classes.
4. The classes must be continuous. There
should be no gaps in a frequency distribution.
5. The classes must be exhaustive. There
should be enough classes to accommodate all the data.
6. The classes must be equal in width.
This avoids a distorted view of the data.
Grouped Frequency Distribution

CHAPTER 2
Presentation of Data
Christian G. Abalos
Presenter

Constructing Grouped Frequency Distribution
1. Find the range.
range = highest value – lowest value
2. Decide on the number of class intervals or
classes, we denote it by k.
´ Sturge’s Formula: k = 1 + log2N
´ another formula:
´ 5 – 20 classes

3. Determine the class size or class width of
the interval, we denote it by c.
(rounded to the nearest odd whole number)
4. Determine the lower limit LL and the upper
limit UL of the lowest class interval. The lowest class
interval should contain the lowest value in the data set. The value of
the UL is determined using the equation
UL = LL + (c – 1)

5. Determine the upper class intervals by
consecutively adding the class size c to the
values of LL and UL of the lowest class
interval until we get the class interval with the
highest value in the data set.
6. Tally the data, find the frequencies.
Note: Other statistical information may be reflected in the table such
as class boundaries, class marks or class midpoints, less than
cumulative frequency (<cf), greater than cumulative frequency (>cf),
and the relative frequency (rf)

´The class boundaries are used to
separate the classes so that there are
no gaps in the frequency distribution.
Other Features of Grouped Frequency Distribution

• The class midpoint is found by adding the upper and
lower boundaries (or limits) and dividing by 2.
• The cumulative frequencies are used to determine the
number of cases falling below (for <cf) or above (for
>cf) a particular value in a distribution.
• The relative frequency (rf) of a class interval is the
proportion of observations falling within the class and
maybe presented in percent.
Thus,
Other Features of Grouped Frequency Distribution
100
x
n
f
rf =

Distribution of scores of forty students in a
Mathematics class.
Example

Why do we construct frequency distribution?

Graphical Presentations of Data
The three most common statistical graphs are
the bar graph (histogram), the frequency
polygon, and the cumulative frequency or the
ogive.
The purpose of graphs in statistics is to convey
the data to the viewer in pictorial form.
Graphs are useful in getting the audience’s
attention in a publication or a presentation.

Graphs FD 502
Histogram
The histogram is a graph that displays the data
by using vertical bars of various heights to
represent the frequencies.

Graphs FD 502
Frequency Polygon
The frequency polygon is a graph that displays
the data by using lines that connect points plotted
for the frequencies at the midpoints of the classes.

Graphs FD 502
Ogive
The ogive is the graph that represents
the cumulative frequencies for the classes
in a frequency distribution.

Graphs FD 502
Other types of Graphs
Pareto Chart
A Pareto chart is used to represent a frequency
distribution for categorical variable, and the
frequencies are displayed by the heights of
vertical bars, which are arranged in order from
highest to lowest.

Graphs FD 502
Pie Chart
A pie chart is a circle that is divided into
sections according to the percentage of
frequencies in each category of the
distribution.

Graphs FD 502
A stem-and-leaf plot is a data plot that uses part of
a data value as a stem and part of the data value as
the leaf to form groups or classes.
It has the advantage over grouped frequency
distribution of retaining the actual data while
showing them in graphic form.

Graphs FD 502
Short History on Stem-Leaf Diagram

CHAPTER 3
Descriptive Statistics
Christian G. Abalos
Presenter

Summary of Measures
Summary
Measures
Central
Tendency
Mean Median Mode
Other
Locations
Percentiles Quartiles Deciles
Variation
Range Variance
Standard
Deviation
Coefficient
of Variation

Measures of Central Tendency FD 502
A measure of central tendency or measure of
central location describes the “center” of a
given set of data. This is a value about which
observations tend to cluster.
´ A single value used to represent the “center” of the data or the typical
value.
´ An index of the central location of a distribution.
´ Precise but simple
´ The most representative value of the data
Measures of Central Tendency

Common measures of central
tendency are the MEAN, MEDIAN, and
MODE.
Arithmetic Mean
The arithmetic mean or simply the
mean is the average of a given set of
data. It is obtained by dividing the sum
of all the observations by the total
number of observations.

population mean for a finite
population with N elements,
denoted by the Greek letter μ
sample mean for a finite sample
with n elements, denoted by
The population mean is a parameter while the sample mean is a statistic.
Arithmetic Mean

A random sample of 5 BSED students about to
take their final examination were asked how
many hours they slept the night before the test.
The data given are 5, 7, 3, 4, and 6. The mean
number of hours of sleep is
x =
xi
i =1
n
å
n
=
5+7+3+ 4 +6
5
= 5 hours
Example

Using the previous data in the previous
example, if the student reported 14 hours of
sleep instead of 3 hours, then the new mean
is
x =
xi
i =1
n
å
n
=
5+7+14+ 4 +6
5
= 7.2 hours
Remark: The mean takes into account all
observations in the data set. Thus, it is
affected by extreme values.
Example

Mean for Grouped Data
x =
fi xi
i =1
k
å
fi
i =1
k
å
where
fi = frequency of the class interval
xi = class mark of the class interval

Given the frequency distribution table
below, find its mean.
Class Interval Frequency
19 – 21 3
16 – 18 10
13 – 15 4
10 – 12 12
7 – 9 6
Example

We solve first the class mark and the
product of the class mark and the
frequency.
Class
Interval
Frequency
( f )
Class Mark
(x) f x
19 – 21 3 20 60
16 – 18 10 17 170
13 – 15 4 14 56
10 – 12 12 11 132
7 – 9 6 8 48
Sf=35 Sfx=466

x =
fi xi
i =1
k
å
fi
i =1
k
å
=
466
35
= 13.3
Class Interval Frequency
( f )
Class Mark
(x) fx
19 – 21 3 20 60
16 – 18 10 17 170
13 – 15 4 14 56
10 – 12 12 11 132
7 – 9 6 8 48
Sf=35 Sfx=466

Weighted Mean
´Utilized when an individual value have varying
importance.
´Weights are assigned to each observed value before
mean is computed.

Find the GPA of a student with the corresponding grades below:
Example
Subjects Grades Unit
A 1.5 3
B 1.1 2
C 1.8 3
D 2.0 4
E 1.5 3
Subjects Grades Unit
A 1.5 3
B 1.1 2
C 1.8 3
D 2.0 4
E 1.5 3
15
Subjects Grades Unit Grades*units
A 1.5 3 4.5
B 1.1 2 2.2
C 1.8 3 5.4
D 2.0 4 8
E 1.5 3 4.5
15 24.6
̅
𝑥 =
24.6
15
= 1.64

Properties of the Mean
´The most common and widely understood measure of central
tendency which utilize all observed value in the calculation;
´Mean can be computed for grouped and ungrouped data hence,
mean may be based not on the actual observed value;
´The mean is affected by extreme values.
´The value of the mean is always existing and unique;
´Mean is utilized when the distribution is not symmetrical and
when all observed values is given equal importance as well as
bases for statistics.

Median
´divides an ordered observation into two equal parts; the
positional middle of the array
´half of the observations are below its value and the
other half are above its value.

Steps in finding the Median
1. Arrange the set of scores in ascending order
(from lowest to highest)
2. If n is odd, there will be a middle score. This
middle score is the median.
If n is even, there will be two middle scores. The
median is taken as the arithmetic average of the
two middle scores.

Below are the scores of 6 students in
their Mathematics test. Find the
median.
35 20 12 30 25 50
Arranging the scores in increasing
order, we have
12 20 25 30 35 50
Example

Since n=6, the median is the average
of the
and
observations. That is,
n
2
æ
è
ç
ö
ø
÷ =
6
2
æ
è
ç
ö
ø
÷ = 3rd n
2
+1
æ
è
ç
ö
ø
÷ =
6
2
+1
æ
è
ç
ö
ø
÷ = 4th
Md =
x3 + x4
2
=
25+30
2
= 27.5

their Mathematics test. Find the
median.
35 20 12 30 25 50 26
Arranging the scores in increasing
order, we have
12 20 25 26 30 35 50
Example

Since n=7, the median is the
observation. That is,
Md = x4 = 26
n+1
2
æ
è
ç
ö
ø
÷ =
7+1
2
æ
è
ç
ö
ø
÷ = 4th

´The median is a positional measure.
´Extreme values affect the median less than the mean.
´Median is utilized when there are extreme observed values
´Median is also utilized when grouped data or a frequency
distribution do not have a true zero point of open-ended class
intervals.
Characteristics of the Median

Mode
The observed value that occurs most often or with the greatest frequency in a
data set.
Mode can be identified by counting the frequency of each observed value and
locating the observed value with the highest frequency
Mode is a less popular measure of central tendency as compared to the mean
and the median, but the easiest and can be considered as a quick estimate for
the measure of central tendency.

Find the mode of 7; 5; 5; 3; 1; 1; 3; 5
Since 5 appears most frequent than the rest of the observed values then , the
mode is 5
Find the mode of 7; 5; 5; 3; 1; 1; 3; 5; 1
Since 5 and 1 appears most frequent than the rest of the observed values then ,
the mode is 1 and 5 .
Example

CRUDE MODE
Mo = 3´ Median-2´ Mean

The mode is the most typical value of a set of observations.
Few low or high values do not easily affect the mode.
The mode is sometimes not unique and nonexistent.
There may be several modes for one data set.
We can get the mode for both quantitative and qualitative types of data
Characteristics of Mode

Central tendency in relation to Levels of
measurement
´ Variables measured categorically are either as nominal or ordinal data
and can only be best represented using frequency counts. Its likelihood of
statistical comparison dwells on differences in terms of proportion.
Specifically, a nominal data can be summarized using mode and an
ordinal data using median or mode. As an implication to this, all statistical
test that utilizes comparison of means are not permissible for categorical
data since primarily, mean as a measure of centrality is not existent for
such data type.
´ Both, interval and ratio accommodate mean as a measure of central
tendency, interval data are less powerful as that of a ratio data due to its
arbitrariness of zero point.

Measures of Relative Position FD 502
Summary of Measures
Summary
Measures
Central
Tendency
Mean Median Mode
Other
Locations
Percentiles Quartiles Deciles
Variation
Range Variance
Standard
Deviation
Coefficient
of Variation

Measures of Relative Position
Percentile
´ Per-centum
´ Divides the ordered observations into 100 equal parts.
´ There are 99 percentiles, denoted as P1, P2, P3, …, P99 with around 1% of the
observations in each group.
We interpret percentiles as follows:
P1, first percentile, is the value below which 1% of the ordered
values fall.
P2, second percentile, is the value below which 2% of the ordered
values fall.
P99, ninety-ninth percentile, is the value below which 99% of the
ordered values fall.

´ Step 1: Arrange the data in ascending order
´ Step 2: Assume that there is no missing data, and all values are existent.
´ Step 3: Let X1, X2, X3, … Xn be the observations arranged from lowest to the
highest
´ Step 4: Denote the percentile of interest with k
´ Step 5: Get the percentile using the formula
(i)
(ii)
Computing the Percentile

Another Formula for
Computing the Percentile

their Mathematics test. Find P85.
16 26 31 32 34 37 39 43
19 29 31 33 34 37 39 44
22 30 31 33 35 37 41 45
25 30 32 33 35 38 41 47
26 31 32 34 36 38 42 47
Example

We seek the value below which
As seen from the table, P85 could be
any value between 41 and 42. To have
a unique value, we define
85
100
´ 40 = 34 observations fall
P85 =
41+ 42
2
= 41.5
Example

Deciles are values that divide a set of ordered
observations into 10 equal parts. These values,
denoted by D1, D2, …, D9, are such that 10%of the data
falls below D1, 20% falls below D2, …, and 90% falls
below D9.
Decile
´ Refers to the nine (9) values that divide an ordered data set into 10 equal parts
´ The ith decile, Di is a value below which 10 x i % of the data lie
D1, first decile, is the value below which 10% of the ordered
values fall.

Quartile
´ Refers to three (3) values that divide an ordered
data sets into 4 equal parts
´ Split Ordered Data into 4 Quarters
The ith quartile, Qi is a value below which 25x i % of the data lie

The lower quartile
denoted by Q1 have the
lowest observed values of
the data set. It divides the
bottom 25% of the
ordered observations from
the top 75%.
The upper quartile
denoted by Q3 have
the highest observed
values of the data set.
It divides the bottom
75% of the ordered
observations from the
top 25%.
The middle quartile
denoted by Q2 contains
the next highest
observed values of the
data set. It divides the
bottom 50% of the
ordered observations
from the top 50%.
Qk ,k = 1,2,3 is a value in an ordered distribution, such that the k% of the ordered data in
the distribution are < Qk.

In a box and whisker plot: the ends of
the box are the upper and lower quartiles, so
the box spans the interquartile range. the median is
marked by a vertical line inside the box.
the whiskers are the two lines outside the box that
extend to the highest and lowest observations.
Box and Whiskers Plot

Box and Whiskers Plot

Measures of Variability FD 502
Measures of Variability
Absolute Dispersion: Necessary to compare two or
more data sets with similar means and unit of
measurement.
Relative Dispersion: Necessary to compare two or
more data sets with different means and varying of
measurement.

Consider the two sets of data below.
Set A
25, 28, 28, 30, 30, 33, 35, 40, 41, 45
Set B
10, 15, 23, 28, 28, 30, 39, 45, 52, 65
They have the same mean (33.5) but
Set A is more homogeneous than Set B.

Range
The range of a set of data is the
difference between the largest and
smallest number in the set.
Example:
In Set A, the range is 45 – 25 = 20.
In Set B, the range is 65 – 10 = 55.

Mean Absolute Deviation (MAD)
MAD =
xi - x
å
N
Where
xi = score
= mean of the scores
N = total number of scores

Mean Absolute Deviation (MAD) Grouped Data
MAD =
f X - X
( )
å
N
Where
X = class mark
X = mean
f = frequency
N = total number of cases

Population Variance
Ungrouped Data
Given the finite population x1, x2, …,
xN, the population variance is
s2
=
xi -µ
( )
2
i =1
N
å
N

Sample Variance
Ungrouped Data
Given a random sample x1, x2, …,
xn, the sample variance is
s2
=
xi - x
( )
2
i=1
n
å
n -1
s2
=
xi - x
( )
2
i=1
n
å
n
Biased estimator:
Unbiased estimator:

Sample Standard Deviation
s =
xi - x
( )
2
i =1
n
å
n -1

Computational Formula for the Sample
Variance (unbiased)
s2
=
n x2
- x
å
( )
2
å
n n-1
( )

Example:
Set A
25, 28, 28, 30, 30, 33, 35, 40, 41, 45
We have
s2
=
xi -33.5
( )
2
i =1
10
å
10-1
=
25-33.5
( )
2
+...+ 45-33.5
( )
2
9
s2
= 43
s = 7

Example:
Set B
10, 15, 23, 28, 28, 30, 39, 45, 52, 65
We have
s2
=
xi -33.5
( )
2
i =1
10
å
10-1
=
10-33.5
( )
2
+...+ 65-33.5
( )
2
9
s2
= 287 and s = 17

Sample Variance
Grouped Data
s2
=
f
( )xi - x
( )
2
i=1
n
å
n
where
f = frequency
x = class mark
= mean
n = total number of observations

Measure of Relative Variation
C.
V .=
s
x
× 100%
To compare the variability of data
sets measured in different units, we use
the measure of relative variation called
coefficient of variation. This index
expresses the standard deviation as a
percentage relative to the mean. It’s
value is given by

Example:
Determine which data set is more
spread out.

We first compute the means and
standard deviations of the sets of data.
Data Set 1:
Data Set 2
:
x = 24 years and s = 3.742 years
x = P 8875 and s = P 2267.984

So, we have
Data Set 1:
Data Set 2
:
C.
V .=
s
x
× 100% =
3.742 years
24 years
× 100% =15.59%
C.
V .=
s
x
× 100% =
P 2267.984
P 8875
× 100% =25.55%
Therefore, net take home pay is more
scattered with respect to the mean than
years of teaching experience of teachers.

Measures of Distribution FD 502
Skewness and Kurtosis
Skewness measures the degree and direction of asymmetry. A
symmetric distribution such as a normal distribution has a skewness of 0,
and a distribution that is skewed to the left, e.g. when the mean is less
than the median, has a negative skewness. the extent to which a
distribution of values deviates from symmetry around the mean. A
value of zero means the distribution is symmetric, while a positive
skewness indicates a greater number of smaller values, and a negative
value indicates a greater number of larger values. Values for
acceptability for psychometric purposes (+/-1 to +/-2) are the same as
with kurtosis.

Skewness and Kurtosis
Kurtosis measures of the "peakedness" or "flatness" of a
distribution. A kurtosis value near zero indicates a shape close to
normal. A negative value indicates a distribution which is more
peaked than normal, and a positive kurtosis indicates a shape flatter
than normal. An extreme positive kurtosis indicates a distribution
where more of the values are located in the tails of the distribution
rather than around the mean. A kurtosis value of +/-1 is considered
very good for most psychometric uses, but +/-2 is also usually
acceptable.

Interpretation of Skewness (Bulmer, 1979)
• If skewness is less than -1 or greater than 1, the
distribution is highly skewed.
• If skewness is between -1 and -0.5 or between 0.5
and 1, the distribution is moderately skewed.
• If skewness is between -0.5 and 0.5, the
distribution is approximately symmetric.

Interpretation of Kurtosis in SPSS
• A kurtosis value near zero indicates a shape close
to normal
• A negative value indicates a distribution which is
more peaked than normal.
• A positive value indicates a shape flatter than
normal.

Thank You…

Intro to Data Analysis and Descriptive Statistics FD 502 Presentation.pdf

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Intro to Data Analysis and Descriptive Statistics FD 502 Presentation.pdf

Ähnlich wie Intro to Data Analysis and Descriptive Statistics FD 502 Presentation.pdf (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Intro to Data Analysis and Descriptive Statistics FD 502 Presentation.pdf