This document provides an overview of descriptive statistics and statistical concepts. It discusses topics such as data collection, organization, analysis, interpretation and presentation. It also covers frequency distributions, measures of central tendency (mean, median, mode), measures of variability (range, variance, standard deviation), and hypothesis testing. Hypothesis testing involves forming a null hypothesis and alternative hypothesis, and using statistical tests to either reject or fail to reject the null hypothesis based on sample data. Common statistical tests include ones for comparing means, variances or proportions.
4. Statistics is a branch of mathematics that deals
with data collection, organization, analysis,
interpretation and presentation.
Data collection is defined as the procedure of
collecting, measuring and analyzing accurate
insights for research using standard validated
techniques.
Data organization refers to the method of
classifying and organizing data sets to make
them more useful, it can be applied to physical
records or digital records.
5. Data analysis is a process of inspecting, cleansing,
transforming, and modeling data with the goal of
discovering useful information, informing
conclusions, and supporting decision-making.
Interpretation of data is the process of assigning
meaning to the collected information and
determining the conclusions, significance, and
implications of the findings.
Presentation of data refers to the organization of
data into tables, graphs or charts, so that logical
and statistical conclusions can be derived from the
collected measurements.
6. Descriptive Statistics gives us information or help
describe the characteristics of a specific data set by
giving short summaries about the sample and
measures of the data.
Basic Statistical Concepts
A population consists of the totality of the
observation and sample is a part of the
population. A variable is any characteristics,
number, or quantity that can be measured or
counted.
7. Two kinds of variables:
1. Qualitative variables also called as categorical
variables are variables that are not numerical.
It describes data that fits into categories.
2. Quantitative variables are numerical. It can be
ranked and has order.
8. Quantitative variables can be classified further into
discrete variables and continuous variables.
A discrete variable is a variable whose value
is obtained by counting.
Continuous variables can assume an infinite
number of values between any two specific
values. They are obtained by measuring. They
often include fractions and decimals.
9. Examples
Discrete
number of students present
number of red marbles in a jar
number of heads when flipping three coins
students’ grade level
Continuous
height of students in class
weight of students in class
time it takes to get to school
distance traveled between classes
10. Types of Statistical Data
1.Numerical data. These data have meaning as a
measurement such as a person’s height, weight, IQ,
or blood pressure or shares of stocks a person owns.
2.Categorical data: Categorical data represent
characteristics such as a person’s gender, marital
status, hometown, or the types of movies they like.
Categorical data can take on numerical values (such
as “1” indicating male and “2” indicating female) but
those numbers don’t have mathematical meaning.
11. Four Levels of Measurement
1. Nominal – the lowest of the four ways to characterize data. It deals with
names, categories, or labels. (eg. colors of eyes, yes or no responses to a
survey, favorite breakfast cereal, and number on the back of a football
jersey).
2. Ordinal – the data at this level can be ordered but no differences between the
data. (eg. ten cities are ranked from one to ten, but differences between the
cities don't make much sense, letter grades where we can order things so that A
is higher than B but without any other information).
3. Interval – deals with data that can be ordered, and in which differences
between the data does make sense. But data at this level has no starting point.
(eg. Fahrenheit and Celsius scales of temperatures).
4. Ratio – the highest level of measurement. Data possess all of the features of
the interval level, in addition to an absolute zero. Due to the presence of a zero, it
now makes sense to compare the ratios of measurements.
13. Methods of Collecting Data
1. In-Person Interviews
Pros: In-depth and a high degree of confidence on the data
Cons: Time consuming, expensive and can be dismissed as anecdotal
2. Mail Surveys
Pros: Can reach anyone and everyone – no barrier
Cons: Expensive, data collection errors, lag time
3. Phone Surveys
Pros: High degree of confidence on the data collected, reach almost
anyone
Cons: Expensive, cannot self-administer, need to hire an agency
4. Web/Online Surveys
Pros: Cheap, can self-administer, very low probability of data errors
Cons: Not all your customers might have an email address/be on the
internet, customers may be wary of divulging information online
14. Three Ways of Presenting Data
1.Textual – this method comprises data
presentation with the help of a paragraph or a
number of paragraphs.
2.Tabular – the method of presenting data using
the statistical table. A systematic organization of
data in columns and rows.
3.Graphical – a chart representing the quantitative
variations or changes of variables in pictorial or
diagrammatic form.
16. Frequency is the rate that measures how often
something occurs.
Example 1
Jack joins football practice every Wednesday morning,
Sunday morning and afternoon.
The frequency of Jack’s football practice every week is 3 (2 on
Sunday and 1 on Wednesday).
By counting frequencies we can make Frequency
Distribution Table.
17. Example 2
Jack’s team has scored the following numbers of goals in their games,
3, 1, 2, 1, 3, 2, 4, 2, 3, 2, 5, 4, 3, 2.
Jack put the numbers in order, then added up:
how often 1 occurs (2 times),
how often 2 occurs (5 times),
how often 3 occurs (4 times),
how often 4 occurs (2 times),
how often 5 occur (1 time)
18. Graphical Representation of Frequency Distribution
A. Bar Graph is a pictorial representation of statistical data in such a way
that length of the rectangles in the graph represents the proportional value
of the variable. Bar graphs are generally used to compare the values of
several variables at a time to analyze data. The length of the bars
(horizontal or vertical) represents the frequency of the variable and is
applicable to discrete categories only.
19. B. Line graph or Line chart is a graphical display of information that
changes continuously over time. Within a line graph, there are points
connecting the data to show a continuous change. The lines in a line graph
can descend and ascend based on the data. We can also compare different
events, situations, and information.
20. C. Pie Chart is a type of graph that displays data in a circular graph. The
pieces of the graph are proportional to the fraction of the whole in each
category. Each slice of the pie is relative to the size of that category in the
group as a whole. The entire “pie” represents 100 percent of a whole, while
the pie “slices” represent portions of the whole.
22. A. Mean
It is the most common measure of central location. It can be
obtained by getting the sum of all values of the observations divided by
the number of observations. In computing for the mean, we use
𝑥 =
𝑥
𝑛
where x is the value of each observations in the sample
n is the total number of observations in the sample
It is worth noting that the mean has the following characteristics:
1. The mean is affected by the presence of extreme values.
2. The sum of the deviations of the observations from the mean is zero.
3. The sum of the squared deviations of the observations from the
mean is minimum.
4. It is a good measure for interval and ratio type of data.
23. B. Median
It is the middle value of a set of observations arranged in
increasing or decreasing order. This measure divides the
data into two equal number of observations.
The median has the following characteristics:
1. It is not affected by the presence of extreme observations.
2. The sum of absolute deviations of the observation from
the median is minimum.
3. It is an appropriate measure for an ordinal type of data.
24. C. Mode
It is the most repeated value or the value that occurs for
the most number of times. Note that it is possible for a
certain data to have two modes. In such case, the
distribution of the data set is bimodal (with two modes).
When a certain data set has more than two modes, the
distribution is called multimodal distribution.
The mode has the following characteristics:
1. Mode is determined by frequency.
2. It is an appropriate measure for nominal data.
25. Example 1 (for ungrouped data)
The following are the 3rd year math grades of an applied math student:
1.6 1.2 1.9 1.5 1.5 1.5 1.0 1.3 1.0
Mean:
X =
X1 + X2 + ⋯ + X9
9
=
1.6 + 1.2 + 1.9 + 1.5 + 1.5 + 1.5 + 1.0 + 1.3 + 1.0
9
= 1.39
Median:
1.0 1.0 1.2 1.3 1.5 1.5 1.5 1.6 1.9
Mode: 1.5
26. Example 2 (for grouped data)
The mean for grouped data is given by
Where fi is the frequency of the ith class interval
xi is the class mark of the ith interval
Solving for the mean:
Class limit 𝒇 𝒙 𝒇𝒙 < 𝒄𝒇 Class boundaries
60 – 67 2 63.5 127 2 59.5 – 67.5
52 – 59 2 55.5 111 4 51.5 – 59.5
44 – 51 6 47.5 285 10 43.5 – 51.5
36 – 43 10 39.5 395 20 35.5 – 43.5
28 – 35 7 31.5 220.5 27 27.5 – 35.5
20 – 27 3 23.5 70.5 30 19.5 – 27.5
𝑥 =
𝑓𝑖𝑥𝑖
𝑛
𝑥 =
127 + 111 + 285 + 395 + 220.5 + 70.5
30
= 40.3
27. The median for grouped data is given by
𝑀𝑑 = 𝐿𝐶𝐵 +
𝑛
2
− 𝑐𝑓
𝑝
𝑓
𝑚
𝑖
i
p
cf
m
f
where LCB is lower boundary of the median class
is the size of the class interval
is the cumulative frequency of the interval preceding the median class
is the frequency of the median class
Median Class – is the class containing cumulative frequency equal to n2 or next
higher.
28. Solving for median:
n
2
=
30
2
= 15
Lower Limit of the Class Boundary
LCB = 35.5
Cumulative Frequency before the median class
𝑐𝑓
𝑝 = 10
Frequency of the median class
fm = 10
Class Size (i) = 8
Median = LCB +
n
2
− 𝑐𝑓𝑝
fm
i
= 35.5 +
15 − 10
10
8 = 39.5
29. The mode for grouped data is given by
𝑀𝑜 = 𝐿𝐶𝐵 +
𝑓
𝑚 − 𝑓1
2𝑓
𝑚 − 𝑓1 − 𝑓2
𝑖
i
1
f
2
f
where LCB is the lower boundary of the modal class
is the size of the class interval
fm is the frequency of the modal class
is the frequency of the class preceding the modal class
is the frequency of the class following the modal class
Modal Class – is the class with the highest frequency.
32. Variability for Ungrouped Data
• Range - The range (R) is defined as the difference between the
highest value (HV) and the lowest value (LV) in the data. That is,
LV
HV
R
• Variance
It is defined as the average of the squared deviations from the mean.
It is the measure that considers the position of each observation
relative to the mean.
𝑠2
=
𝑖
𝑥𝑖 − 𝑥 2
𝑛 − 1
or
)
1
(
2
2
2
n
n
x
x
n
s
33. • Standard Deviation (the most widely encountered) - It is
the measure of the spread or dispersion of scores from the
mean of distribution. It is the square root of the variance.
𝑠 =
𝑖
𝑥𝑖 − 𝑥 2
𝑛 − 1
or
)
1
(
2
2
n
n
x
x
n
s
Variability for Grouped Data
Range: mark
Class
Lowest
mark
Class
Highest
R
Variance:
)
1
(
2
2
2
n
n
fx
fx
n
s
Standard Deviation:
)
1
(
2
2
n
n
fx
fx
n
s
35. Hypothesis testing is the most significant area of statistical
inference. It is a step-by-step process in making inferences
(conclusions) about a population.
The truth value of a statistical hypothesis can only be identified
when we take a portion of the population of interest and use the
information obtained from this portion to decide whether the
statistical hypothesis is likely to be true or false. We either “reject”
the statistical hypothesis when inconsistencies from the sample
occur, or “not reject” otherwise. Note that the rejection of a
statistical hypothesis means that it is false, but its acceptance does
not necessarily mean it is true. Acceptance of the stated hypothesis
implies that there is not enough evidence to reject it.
36. Types of Statistical Hypothesis
We use the term null hypothesis for the hypothesis we
want to test, that is, to either reject or accept, denoted by H0.
If the null hypothesis is rejected, the alternative hypothesis,
denoted by H1, will then be accepted. The null hypothesis
H0 is stated such that it specifies an exact value while the
alternative hypothesis H1 is stated such that it allows for the
possibility of some certain values. For example, if the null
hypothesis H0 is 𝑥 = 8, the alternative hypothesis H1 might
be 𝑥 < 8, 𝑥 > 8, or 𝑥 ≠ 8.
37. Types of Statistical Tests
If the alternative hypothesis of any statistical test is one –
sided, for example, H1: 𝑥 < 8 or H1: 𝑥 > 8, it is said to be a
one – tailed test. On the other hand, if the alternative
hypothesis is two – sided, for example, H1: 𝑥 ≠ 8, the test is
said to be two – tailed.
Types of Error
However deciding whether to accept or reject any statistical
hypothesis of a population parameter is critical that it might lead
to wrong conclusions. For instance, a researcher could reject H0
when in fact, it is true. Such is called a type I error. Also, one
might accept H0 even when it is false. In this case, a type II error
occurred.
38. Constructing the Null and Alternative Hypothesis
A.Testing for Means
In hypothesis testing, means, variances, or proportions may
be compared so as to justify the need to reject or accept the null
hypothesis. But there are many instances that the sample means
were compared using experimental and control groups.
39. Example 1
1. A researcher wants to know if the average test score of the students taking a
particular examination is 80.
H0: 𝜇 = 80 (the average test score of the students taking a
particular examination is 80)
H1: 𝜇 ≠ 80 (the average test score of the students taking a
particular examination is not 80)
2. A small group of researchers is conducting a study to show if the average
number of hours a student spends on social media sites per day is greater than
10.
H0: 𝜇 = 10 (average number of hours a student spends on social
media sites per day is 10)
H1: 𝜇 > 10 (average number of hours a student spends on social
media sites per day is greater than 10)
40. 3. A teacher wants to know if there is a difference in the performance of his
two classes based on their average grades.
H0: 𝜇1 = 𝜇2 (there is no difference in the performance of his two
classes based on their average grades)
H1: 𝜇1 ≠ 𝜇2 (there is a difference in the performance of his two
classes based on their average grades)
4. A researcher wants to study if the customer satisfaction level of a cable
television company A is greater than a cable television company B.
H0: 𝜇1 = 𝜇2 (the customer satisfaction levels of two competing
cable television companies are the same)
H1: 𝜇1 > 𝜇2 (the customer satisfaction levels of a cable television
company A is greater than a cable television company B)
41. 5. A clinical trial is conducted to compare three different weight
loss programs based on the average weight measured among three
groups at the end of the program.
H0: 𝜇1 = 𝜇2 = 𝜇3 (there is no difference on the three weight
loss programs)
H1: 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙
(there is a difference on the three weight loss
programs)
42. B. Testing for Independence
The chi-square (𝜒2
) test is used to test the independence of two
variables. In other words, this test is used to determine whether the
two variables are related or not, based on the sample selected from
each variable.
Example 2
1. A survey is conducted to test if the grades of the students are associated to the number of
hours they spend on social media sites.
H0: The grades of the students are not associated to the number of hours they spend
on social media sites.
H1: The grades of the students are associated to the number of hours they spend on
social media sites.
2. A study shows that the daily consumption depends on the age level of a person.
H0: The daily consumption does not depend on the age level of a person.
H1: The daily consumption depends on the age level of a person.
43. C. Correlation
To determine whether two variables (usually x and y) are
linearly related, correlation is the statistical method to be used.
In this method, the data collected on two numerical variables
are tested to determine the strength of their relationship
estimated by the sample correlation coefficient r given by
𝑟 =
𝑛( )
𝑥𝑦 − ( 𝑥)( )
𝑦
𝑛( 𝑥2) − 𝑥 2 𝑛( 𝑦2) − 𝑦 2
where −1 ≤ 𝑟 ≤ 1 𝑎𝑛𝑑
𝑛 = number of data pairs
44. If the value of 𝑟 is close to positive 1, then there is a strong positive linear
relationship between the two variables. If 𝑟 is close to negative 1, there is a
strong negative linear relationship between them. However, if the two
variables has a weak or no linear relationship, 𝑟 is close to 0.
Example 3
1. A study is conducted to show how strong is the relationship between sleeping habit of
employees and their level of performance at work.
H0: Sleeping habit of employees is not related to their level of performance at work.
H1: Sleeping habit of employees is related to their level of performance at work.
2. A student wants to know if his grade in Mathematics is associated to his grade in English.
H0: His grade in Mathematics is not associated to his grade in English.
H1: His grade in Mathematics is associated to his grade in English.
45. Student Hours of Study Grade
A
B
C
D
E
F
7
3
2
6
3
4
83
63
60
88
68
75
3. A researcher wishes to see whether there is a relationship
between number of hours of study and test scores on an exam.
The following data were obtained.
46. Solution:
To solve for the correlation coefficient r, we must find first the
values of 𝑥𝑦, 𝑥2
, and𝑦2
.
Studen
t
Hours of
Study (x)
Grade
(y)
𝑥𝑦 𝑥2
𝑦2
A
B
C
D
E
F
7
3
2
6
3
4
83
63
60
88
68
75
581
189
120
528
204
300
49
9
4
36
9
16
6889
3969
3600
7744
4624
5625
𝚺𝒙 = 25 𝚺𝒚 = 437 𝚺𝒙𝒚 = 1922 𝚺𝒙2
= 123 𝚺𝒚2
= 32451
47. Substituting the values to the formula,
𝑟 =
6)(1922) − (25)(437
6 123 − 25 2 6 32451 − 437 2
𝑟 = 0.934
Since the correlation coefficient is close to +1, it indicates
a strong linear relationship between the number of hours
of study and test scores on an exam of students.
48. D. Regression
Computing the correlation coefficient means determining the
strength of the relationship between two numerical variables. When
the resulting correlation coefficient is significant, then regression
analysis can be done. Regression is used to understand the movement
or trend of the given data so predictions can be made.
The regression equation is given by 𝑦′
= 𝑎 + 𝑏𝑥
𝑎 =
𝑦)( )
𝑥2
− ( 𝑥)( )
𝑥𝑦
𝑛( 𝑥2) − 𝑥 2
𝑏 =
𝑛( 𝑥𝑦) − ( 𝑥)( )
𝑦
𝑛( 𝑥2) − 𝑥 2
where
49. Example 4
Let us take the example in correlation section since a strong linear relationship exists
between the number of hours of study and test scores on an exam of students.
Solution:
Since 𝑥𝑦, 𝑥2
, and𝑦2
are necessary to solve for 𝒂 and 𝒃, we must solve them first.
Student
Hours of
Study (x)
Grade
(y)
𝑥𝑦 𝑥2
𝑦2
A
B
C
D
E
F
7
3
2
6
3
4
83
63
60
88
68
75
581
189
120
528
204
300
49
9
4
36
9
16
6889
3969
3600
7744
4624
5625
𝚺𝒙 = 25 𝚺𝒚 = 437 𝚺𝒙𝒚 = 1922 𝚺𝒙2
= 123 𝚺𝒚2
= 32451
50. Then we have,
𝑎 =
(437)(123) − (25)(1922)
6 123 − (25)2
= 50.451
𝑏 =
(6)(1922) − (25)(437)
6 123 − (25)2
= 5.372
Hence, the equation of the regression line is
𝒚′
= 𝟓𝟎. 𝟒𝟓𝟏 + 𝟓. 𝟑𝟕𝟐𝒙
Suppose we want to know the grade (𝒚′
) of the student if he/she studies in x
hours. For example, let 𝑥 = 9. Then,
𝑦′
= 50.451 + 5.372(9)
𝑦′
= 98.80
Let 𝑥 = 5. Then,
𝑦′
= 50.451 + 5.372(5)
𝑦′
= 77.31