General Principles of Intellectual Property: Concepts of Intellectual Proper...
An Overview of Basic Statistics
1. AN OVERVIEW OF BASIC STATISTICS
Statistics being a branch of mathematics is often associated with anxiety or unease
particularly among students who fear mathematics for one reason or another. Well, if you are
one of those students then sit, relax, and read this chapter first as it will take you through a
journey in which you will discover a world of fun, excitement, vision, and creativity. With
minimum mathematical details, this chapter introduces key concepts and universal terms that
are used among statisticians and briefly discusses common statistical tools, their underlying
principles and their practical merits.
1
2. Why should you learn statistics?
In the general sense, statistics is the science of dealing with variability, uncertainty, and subjectivity
to produce objective and quantitative information that can assist in making reliable decisions about
numerous situations in life. Globally, statistics is a key tool in governments and organizations
activities.
The reason we need statistics is that we are living in a world of numbers, or more precisely a world of data.
• Numbers are called data, only when they reflect information
• Data, on the other hand, are called statistics when they reflect specific or descriptive measures
of the phenomenon or the event under study
2
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
…………………………………………………………………………………………………………………………………………………………………………………………….
3. What is statistics?
3
Definition Source
"The mathematics of the collection, organization, and
interpretation of numerical data, especially the analysis of
population characteristics by inference from sampling."
American Heritage Dictionary®
"A branch of mathematics dealing with the collection, analysis,
interpretation, and presentation of masses of numerical data."
The Merriam-Webster’s Collegiate
Dictionary®
“The scientific application of mathematical principles to the
collection, analysis, and presentation of numerical data.”
The American Statistical
Association (ASA,
http://www.amstat.org/)
Definitions of statistics
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
4. What is statistics?
“The science and art of reading, describing, and manipulating data, which represents variables so that
practical observations about a population can be made from a sample drawn from the population, and
guidelines can be established to allow making precise and accurate conclusions about a certain
process or system”
Statistics: between science and art
• Science stems from the use of mathematical concepts
• Art stems from extrapolation, interpretation, and judgment
It is well-known, statistics does not provide causes and effects;
it only yields analysis outcome based on the data used.
It is then your job to provide causes and effects.
This is where the art of reading data and interpreting the
results of the statistical analysis comes into play.
4
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
5. 5
What is statistics?
“The science and art of reading, describing, and manipulating data, which represents variables so that
practical observations about a population can be made from a sample drawn from the population, and
guidelines can be established to allow making precise and accurate conclusions about a certain
process or system”
Numbers, data and statistics
100, 140, 213, 230, 180, 211, 120, 160, 200, 110, 260, 235, 280, 180, 300
NUMBERS
Human Weight (pound):
100, 140, 213, 230, 180, 211, 120, 160, 200, 110, 260, 235, 280, 180, 300
DATA
Statistics
Mean Value = 194.6 pound
Minimum = 100 pound
Maximum = 300 pound
Range = 200 pounds
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
6. 6
What is statistics?
“The science and art of reading, describing, and manipulating data, which represents variables so that
practical observations about a population can be made from a sample drawn from the population, and
guidelines can be established to allow making precise and accurate conclusions about a certain
process or system”
A Constant is a parameter that does not change:
- Your social security number will not change over time
- Your birth date will not change over time
A Constant may also refer to a controlled variable:
*** If you can maintain your grade at an “A” from course to course, it becomes
a controlled variable or a constant
A Variable is a parameter that is likely to change
- Grades of different students
- Your income from one year to another
- The heights of different students in your class
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
7. 7
What is statistics?
“The science and art of reading, describing, and manipulating data, which represents variables so that
practical observations about a population can be made from a sample drawn from the population, and
guidelines can be established to allow making precise and accurate conclusions about a certain
process or system”
Population Sample
A population implies a totality or a complete collection of things
(all students in a college, all people in a town, all machines in a
factory, and so on)
A sample is a sub-collection of units or components selected from
a population (few students from a college, few people from a town,
and few machines from a factory, and so on)
A census is a collection of data from every member of the
population (all students’ grades, all people ages, and all machines’
efficiencies, and so on).
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
9. 9
Working Problem 1.1:
Major television networks constantly monitor the popularity of their programs via asking
some specialized organizations such as Nielsen company to sample the preferences of
TV viewers.
(a) Suppose 1000 TV prime-time viewers selected randomly were asked if they watched a new talk-show, and 450
indicated they watched the show. Identify the population and the sample. Describe the data set.
(b) In another survey, suppose 1100 TV prime-time viewers selected randomly were asked if they watched the 2009
Super Ball on TV, and 999 indicated they watched the game. Identify the population and the sample. Describe the
data set.
What is statistics?
“The science and art of reading, describing, and manipulating data, which represents variables so that
practical observations about a population can be made from a sample drawn from the population, and
guidelines can be established to allow making precise and accurate conclusions about a certain
process or system”
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
19. 19
Simple Numerical descriptive statistics
Grades 82 76 85 88 95 86 84 87 96 78
Using statistics, we can provide two types of description:
76 78 82 84 85 86 87 88 9596
(1) Data Center (e.g. Mean Value)
(2) Data Spread (Variability)
(e.g. Range)
Example:
R = 96-76 = 20
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
20. 20
Working Problem 1.4:
Calculate the mean and the range for the following data set of people
income ($):
20,000, 22,000 ,28,000, 30,000, 27,000, 45,000, 50,000, 60,000
Simple Numerical descriptive statistics
Working Problem 1.5:
Calculate the mean and the range for the following data set of people weight (pound)
125, 145, 160, 155, 110, 95, 175, 158
Working Problem 1.6:
Calculate the mean and the range for the following data set of people temperature (Fo)
95.5, 98.0, 99.0, 96.5, 95.0, 97.0, 96.0, 98.0
Working Problem 1.7:
Calculate the mean and range for the following data set of property taxes ($)
8100, 3500, 7000, 4200, 3000, 5000, 5100, 4000, 7500, 4800
24. 24
Key Points about Descriptive Statistics:
• Any statistical analysis should begin by performing descriptive statistics
• Descriptive statistics represent a powerful tool of data description particularly
when a large amount of data must be analyzed
•The key elements of descriptive statistics are the measures of central tendency
and the measures of variability
• Using descriptive statistics, data abnormality or data errors can be detected
even for a large amount of data
• Frequency distributions and histograms represent graphical approaches of data
description
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
25. 25
What is probability?
Most people use terms such as chance, likelihood, or probability to reflect the level of uncertainty
about some issues or events. Some of the common examples of using these terms are as follows:
• As you watch the news every day, you hear forecasters saying that there is a 70% chance of rain
tomorrow.
• As you plan to enter a new business, an expert in the field tells you that the probability of making a
profit in this business is only 0.4, or there is a 40% chance that you will make a profit.
• As you take a new course, you may be wondering about the likelihood of passing or failing the
course.
• Your friend is undergoing a surgery and the physician is telling him that his chance of surviving the
surgery is 95%.
• You hear it on health news all the time that a smoker has a greater chance of getting lung cancer
than a nonsmoker.
These are all expressions of probability that we often hear or read about and they can affect
our planning or intention to do or not to do things in life.
26. 26
What is a probability Value?
The general definition of probability is that it is a value between zero and one, which reveals
the relative possibility an event will occur.
• A probability of zero or close to zero implies that an event is very improbable to occur
• A probability of one or close to one gives us higher assurance that an event will occur.
• Between these two extremes, different values of probability will be expressed as a decimal
such as 0.33, 0.7, or 0.50, as a fraction such as 1/3, 7/10, or 1/2, or as a percent such as
33.33%, 70%, or 50%.
Classically, probability is defined by the ratio of the number of particular target outcomes of an
event to the number of all possible equally likely and mutually exclusive outcomes. For example,
we know that in tossing a coin, the chance of head being the outcome is 50% or, in the context
of probability, we say that the probability of head occurrence is 0.5. This value is a direct
calculation from the classic definition of probability where the total number of possible
outcomes in tossing a coin is 2 (head or tail), and the chance of head being the outcome is 1/2.
H T
27. 27
Sampling and sampling techniques
Sampling:
•Selecting a number of representative samples (say, k samples, each of size n) from a
population using a technique suitable for the population under consideration and the
purpose of testing
•Testing the samples and listing sample data
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Why Sampling?
• Population can be too immense to test
• Test can be destructive
28. 28
What are the different types of sampling?
Random
Sampling
Population Sample
Stratified
Sampling
I II
III IV
I
II
IV
III
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
29. 29
What are the different sources of variability?
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
30. 30
Working Problem 1.9:
Identify the following sources of variability as inherent or induced:
(a) If one picks any boll of cotton from the field, one will not find two cotton
fibers in this boll that are similar in length, diameter, or maturity.
Inherent ( ) Induced ( )
b. You open a box of water bottles and you find that some bottles are completely filled and some are
half-empty
Inherent ( ) Induced ( )
c. At the workplace, you find some people performing better than other people of the same
experience and background
Inherent ( ) Induced ( )
d. In a sample of natural soil aggregates taken from a certain area you find no two soil particles that
are alike in size or texture.
Inherent ( ) Induced ( )
31. 31
What are the different types of variables?
Classification by data nature or measurement
levels:
•Continuous variables
•Discrete variables
•Special variables
Classification by Information Nature:
Quantitative Variables
Qualitative Variables
Two Ways to Classify variables:
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
32. 32
Working Problem 1.10:
Decide whether the following variables are discrete or continuous. Explain your reasoning.
1. The weights of football players in a team:
Discrete ( ) Continuous ( )
2. The number of defects in a shipment of cellular phones:
Discrete ( ) Continuous ( )
3. The number of passing students in a course:
Discrete ( ) Continuous ( )
4. Student grades in a course:
Discrete ( ) Continuous ( )
5. The speed at which different cars run on a highway:
Discrete ( ) Continuous ( )
33. 33
What are the levels of measurements?
Nominal Measures
The Pyramid of Data Levels of Measurements
Qualitative, no magnitude, no order
(e.g. colors, genders, country name, person name)
Ordinal Measures
Zero has no numerical meaning, and meaningful order
(e.g. ranking products as 1= superior, 2= very good, 3=
good, 4= fair, 5= poor)
Interval
Measures
Zero has no numerical meaning, meaningful
order but constant incremental difference (e.g.
temperature)
Ratio
Measures
Quantitative, meaningful magnitude, zero
has numerical meaning, ratio between
numbers is meaningful, and meaningful
order (e.g. weight, and length)
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
34. 34
In order to avoid any confusion about what level of measurement a variable belongs to, we
should begin by addressing the following key questions (Michael Sullivan III, 2010, Statistics-
Informed Decisions Using Data, Third Edition, Prentice Hall, Pearson Education, Inc.):
• Does the variable simply categorize each individual? If so, the variable is nominal (e.g.
gender)
• Does the variable categorize and allow ranking of each value of the variable? If so, the
variable is ordinal (letter grade in your calculus class)
• Do differences in values of the variable have meaning, but a value of zero does not mean
the absence of the quantity? If so, the variable is interval (e.g. temperature).
• Do ratios of values of the variable have meaning and there is a natural zero starting point?
If so, the variable is ratio (e.g. human weight and number of hours you study every week)
Key tips to figure out the level of measurement:
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
35. 35
Example: What is the level of measurement for each of the following variables?
a. Your score in the first Math quiz
b. People income in a given business
c. Classification of students by gender
d. Ranking of education by K-12, community college, and four-year University
e. The number of hours you spend watching TV every week
What are the levels of measurements?
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
36. 36
Working Problem 1.11:
What is the level of measurement reflected by the following data?
(a) The ages of a sample of students entering first year of college:
18, 21, 19, 17, 16, 22, 21, 22, 23, 18, 18, 17, 19
(b) In a survey of 500 luxury-house owners (above $2million price), 200 were from California, 150 from New York, and 150 from
Florida.
Working Problem 1.12:
What is the level of measurement for each of the following variables?
a. Student scores in the first stat test
b. Waiting time (minutes) waiting for a school bus
c. Classification of employee by gender
d. A ranking of students as freshman, sophomore, junior, and senior
e. The weight of football players
37. 37
What is a Normal Distribution?
The normal distribution is the key to numerous Statistical Concepts
A normal data set of a variable will typically have
the following basic characteristics:
• Relatively few components or elements will exhibit low values of the variable
• Relatively few components or elements will exhibit high values of the variable
•The majority of values will be in the middle
Frequency
Characteristic value
10 20 30 40 50 60 70 80
Examples:
• People Income
• Student Grades
• People Weight
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
38. 38
Center Values
High Values
Low Values
Frequency
Characteristic value
Measures of Central Tendency
Measures of Data Dispersion
Mean
Mode
Median
General Features of a Normal Distribution
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
39. 39
Comparison of the Distributions of Area of Ceramic Tiles of Two Different Types
Type A
Type B
Frequency
AreaMean
Mode
Median
High variability
Low variability
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Example: The Areas of Two Different Types of Ceramic Tiles
40. 40
Characteristic value
Frequency
Distributions associated with Different Categories of Variables
Larger-the-Better Variables
- Students grades
- Machine efficiency
- People income
Nominal-the-Best Variables
- Thickness of wood board
- Area of ceramic tiles
- Shoe size
Smaller-the-Better Variables
- Number of defects
- Number of failing students
- Number of car accidents
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Classifying Variables by the Nature of Values:
41. 41
Working Problem 1.13:
The frequency distributions below illustrate the grades of three quizzes for students taking a
biology class. The final frequency distribution combines all the grades of the three quizzes. Describe
the students’ performances in each quiz and the overall performance of the class.
0
5
10
15
20
25
30
35
40
45
50
Percent
Quiz 1
0
5
10
15
20
25
30
35
40
Percent
Quiz 2
0
5
10
15
20
25
30
35
40
45
50
Percent
Quiz 3
0
5
10
15
20
25
Percent
All Quizzes
42. 42
40 45 50 55 60 65 70 75 80 85 90 95
50
45
40
35
30
25
20
15
10
5.0
0.0
Quiz 1
Quiz 2
Quiz 3
All Quizzes
Grades
PercentFrequency
Working Problem 1.14:
The frequency distributions shown below are the same as those in Working Problem 1.13 but
superimposed. Describe the progressive students’ performances in each quiz and the overall
performance of the class.
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
43. 43
Working Problem 1.15:
The frequency distribution shown below represents the average grades of the three quizzes in
Working Problems 1.13 or 1.14. Describe this distribution.
0
5
10
15
20
25
30
35
Percent
Grade Averages
Frequency Distribution of Average Grades of the Three Quizzes
44. 44
Inferential statistics: how do you estimate population parameters from sample statistics?
Population
Sample
Population Parameters
Population mean (m)
Population Mode
Population Range
Population Standard Deviation (s)
Inferential Statistics
Notes:
……………………………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………………………
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Descriptive Statistics: The analysis of determining sample statistics (e.g. mean and range)
Inferential Statistics: The analysis of estimating population parameters from sample
statistics
Descriptive Statistics
Sample Statistics
Sample mean (X-bar)
Sample Mode
Sample Range
Sample Standard Deviation (s)
Hinweis der Redaktion
Solution: See Working Problem slides
Solution: See Working Problem slides
In describing a measure, we may have one of the following four scenarios:
Inaccurate and imprecise measure
Accurate but imprecise measure
Precise but inaccurate measure
Precise and accurate measure
Solution: See Working Problem slides
Process and system
In general, the term ‘process’ can be used to imply one or more of five basic elements: machine, material, methodology, people, and environment.
The term ‘system’ is typically used when we are prepared to specify inputs and outputs.
Process and system
Process Example: We may describe education as a process in which material implies course material (papers, books, etc.), machine implies media projector or computers, methodology implies the method of teaching (face-to-face, or distance learning), people implies teachers and students, and environment implies a class room or a laboratory.
System Example: a weaving machine (a loom) is a system in which the input is the yarns, and the output is the fabric
How to describe data
The starting point in any statistical analysis is to read data using the language of statistics. As in any language where a sentence will not be understood or communicated without an organized sentence structure, the language of statistics uses the so called ‘descriptive statistics’ to establish an organized and meaningful display of data with the goal being to reduce the flood of data presented down to few statistics that can fully describe the data. Another important reason for performing descriptive statistics is to detect any abnormality in the data that may result from typographical errors, data acquisition errors, or mixing of data sets. For this reason, it is always a good practice to begin any statistical analysis by performing an analysis of descriptive statistics. The following example illustrates these points.
Give the student time to describe each data set before moving to the next slide
Solution:
Quiz 1: all grades are identical; each student made 90 in the first quiz. Is this possible?
Quiz 2: most grades range from 77 to 88, but two grades seem exceptional, 30 and 99.
Quiz 3: grades seem to be split into two groups: a group in the high 60s and low to mid 70s and another group in the high 80s or above
Quiz 4: grades seem to reflect a typical quiz grades with few students making relatively low grades, 76 and 78; few students making high grades, 95 and 96; and the majority are in the intermediate range (82, 84, 85, 86, 87, 88).
This example is an over simplification of a frequency diagram and should be considered only for illustration of the concept of graphing frequency. In the next slide, the example will illustrate a typical Histogram.
Discuss this with students without detailed description of how you construct a histogram as this will be covered in Chapter 3.
The frequency distribution here simply illustrates two key points:
The bulk of the data concentrating from 130 to 170
Outliers or extreme values
We will learn more about Probability in Chapters 4, 5, and 6
The most common sampling techniques are:
Random sampling
Stratified sampling
Cluster sampling
Random sampling- In random sampling, every element in the population will have the same opportunity of being drawn in the sample. For example, in a class room of 40 students, we can draw a representative sample of 10 students by writing each student name or identification number on a post card, shuffling the 40 cards, placing them in a box, and blindly withdraw 10 cards representing the sample. This is the physical approach to perform random sampling. In Chapter 7 we will discuss how to perform random sampling using Excel® data analysis in a matter of seconds.
Stratified sampling- In some situations, complete random sampling may not be possible due to the difficulty of accessing some portions of the population. In this case, we may use the so-called ‘stratified sampling’. In this approach, we simply divide the population into pre-specified categories, or strata, and draw samples from each category in random fashion and with size proportional to the category size in the population. This approach ensures that each category of the population is represented in the sample. For example, suppose we have a small city population of 100,000 people; 80,000 white, 15,000 African American, and 5,000 Hispanic. If these ethnic groups are intermingled in the city, we should then use stratified sampling to ensure that each group is represented in the sample and in proportional-weight allocated fashion. Accordingly, if the sample size is 100 people, 80 people should be selected randomly from the white people group, 15 from the African American group, and 5 people from the Hispanic community group. Figure 1.6 illustrates the differences between random sampling and stratified sampling.
Cluster Sampling- This is used when the population itself consists of natural or pre-designed clusters, with each cluster having its own unique features. For example, a company may have three or four factories in different areas producing the same product. In this case, a cluster sample should be a sample consisting of components from each factory or cluster.
In general, there are two main sources of variability in a process: (i) inherent variability, and (ii) induced variability. Inherent variability normally results from natural sources or from some form of limitation in any process element (material, machine, people, etc). For example, humans are inherently different by virtue of their creation. This is a positive variability as no one can conceive a world with all human being alike in everything. Even twins are inherently different in so many aspects. Some statistics suggest that there are roughly over 120 million living human twins in the world today, and nearly 10 million monozygotic twins (twins derived from a single fertilized egg zygote, or human identical twins). Similarly, plants, animals, and all types of God-created things are inherently different.
Induced variability is defined by the differences between things, which might otherwise be thought to be alike because they were produced under conditions, which were as nearly alike as it is possible to make them. This type of variability can be observed in products that are made to be identical (coffee makers, refrigerators, cars, etc.), yet they exhibit some differences resulting from human or manufacturing errors. For example, a manufacturer producing ceramic tiles set the machines and the manufacturing operations to produce the same dimensions of ceramic tiles every day, yet these tiles may encounter some variability in thickness, area, and even color in the absence of good control of the manufacturing process.
A variable represents a phenomenon or a characteristic of a process or system under investigation in which different elements are associated with different values. There are three main types of variables:
Continuous variables
Discrete variables
Special variables
A continuous variable can take any value in a specified interval falling within its plausible range. For example, the diameter of a fine metal rod may take a value of 40, 40.25, 40.75 or 41 millimeter. These data imply continuity in the variable values. Similarly, variables such as student grades, people income, ages, material strength, thickness, length, temperature, concentration, and weight are all of continuous type. In one sentence, continuous variables are variables whose sample spaces (or all possible outcomes) are intervals.
Discrete variables are variables whose sample spaces are sets of discrete points. They are associated with situations in which a count is made such as the count of pass or fail, stop or go, and good or bad. Educators use discrete variables to measure performance parameters such as the number of failing students, the number of drop-out students, and the number of absences. Engineers and manufacturers use discrete values to describe many parameters such as the number of defects, the number of rejects, and the number of machine failures. A discrete variable can take only integer values (e.g. 0, 1, 2, ...). Specifying the variable as being continuous or discrete is critical for performing a reliable statistical analysis. In Chapters 5 and 6, we will discuss probability distributions for discrete and continuous variables, respectively.
Special variables are those that are not easily described in quantitative manner. This type of variables is typically initiated from nonnumeric sources. For example, when we describe the comfort resulting from wearing some clothing, we may use terms such as smooth, flexible, harsh, rough, soft and stiff. In order to use statistics to describe the extent of comfort, we should first express these characterizations in some numeric format (e.g. establishing a numeric index). Depending on the structure of this format, the special variable can then be treated as a discrete or a continuous variable.
Another way to classify variables is by dividing them into two main categories: ‘quantitative’ and ‘qualitative’ variables. Quantitative variables are those that can be measured numerically. These include income, weight, temperature, grades, etc. Qualitative variables are those that cannot be described numerically but may be converted into some kind of numerical codes depending on their natures. Categorical variables are just another name for qualitative variables. For example, when we categorize students as freshman, sophomore, junior, and senior, we are in essence using qualitative variable. If we have to include these categories in a data set, it would be useful to associate them with some codes such as ‘1’ for freshman, ‘2’ for sophomore, ‘3’ for junior and ‘4’ for senior. This is a key step in dealing with this type of variables in a statistical analysis. Other qualitative variables include: color, taste, style, pain level, gender, race, and job position.
Data can talk! Or data often try to tell us something!! But, to what extent data can truly tell us what we need to know? The key to answering this question stems from the level of measurement of the data under consideration. In order to describe a variable we have to use associated data and in order to use the data in a meaningful analysis, we must realize the data level of measurement. For example, if we have data representing male and female and we decided to give male, say a code number 1, and female a code number 2, then taking the average of these two numbers, which will be 1.5, would mean nothing and will serve no purpose in our analysis. The reason, this average exhibits no meaning is that the data used is at its lowest level of measurement or at what is generally called the nominal level of measurements; it is only a code with the numbers 1 and 2 being used to distinguish males from females in the data set in case we need to analyze each category separately or compare between the two in terms of other variables such as income, weight, ages, etc.
In general, there are four basic levels of measurements as illustrated by the pyramid of Figure 1.7, where the data resolution and meaningful aspects progressively increase as we move from the base of the pyramid to the top. The lowest level is the nominal measurement data. At this level, variables are typically qualitative, mutually exclusive and exhaustive. They typically have no meaningful magnitude or logical order and we can do very little with this type of data in terms of statistical analysis except for data organization and information management. Examples of nominal measurements include: eye or skin color, gender, people names, and religion.
Ordinal measurement level data are those used for sequencing or ranking. For example, you may wish to rate your professor after completing a statistics course according to a particular trait, say as excellent, very good, good, fair, and poor. Numbers assigned for these ratings may be: 0 (poor), 1(fair), 2(good), 3(very good), and 4(excellent). These numbers, though at equal distances, do not imply a rating magnitude; meaning, a rating of 4 is not twice as much a rating of 2. However, the order is meaningful as the higher the number the better the rating. Again, the data here are mutually exclusive and exhaustive.
Interval measurement level data exhibit all the characteristics of ordinal measurement data but in addition, the difference between values is constant. In other words, they exhibit constant increment that has some meaning. Temperature is the classic example of this type of measurements. Twenty degrees is certainly the same anywhere on the Fahrenheit temperature scale, but 60 degrees is not twice as hot as 30 degrees. Interval data measurement has no value-related zero. For example, on a temperature scale, zero Fahrenheit does not mean an absence of temperature or heat; it only implies cold and if we change the units to Celsius, this zero will roughly become -18 degree. Again, interval measurement data are mutually exclusive, exhaustive, and equal differences in the variable are represented by equal differences in the measurements. Other examples of interval measurements include shoe sizes and IQ scores.
Ratio measurement level data represent the most commonly used data in statistical analysis. Virtually all quantitative data are essentially ratio data. They have meaningful magnitude, order, and ratios between numbers. The zero value of ratio measurements means an absence of the characteristic magnitude. These include: weight, length, strength, income, profit, area, etc.
Solution:
a. Interval b. Ratio c. Nominal d. Ordinal e. Ratio
A normal data set will yield a distribution called ‘the Gaussian’ or the ‘normal distribution’. In its ideal shape, the normal distribution is characterized by a symmetrical bell-shaped distribution as shown here. It has equal values of mean, mode, and median and they are all located at the center of the distribution. These are the key parameters that describe the center of the data, and they are collectively called ‘measures of central tendency’. The spread of the data is described by the so called ‘measures of data dispersion’. These include the range, and the standard deviation. In Chapter 2, we will discuss how these measures are calculated.
The normal distribution is fully described mathematically by two parameters: the mean of the data, and the standard deviation, which is a measure of the deviation of all values from the center. What students need to know at this point is the shape of the normal curve and how this shape relates to the distribution of values of a variable.
It will also be useful that students begin to describe data by sketching a rough normal curve illustrating the variable values and their distribution. This point is illustrated in the example in the next slide.
Example Figure 1.9 shows the distributions of the area of ceramic tiles of two types of tiles produced by two different suppliers. The tiles are advertized at the same price and with the same nominal specifications by the two suppliers. Describe the normal distributions of area for these two types of tiles.
This example is provided without data, only the shapes of the distributions of the ceramic tile area, to assist students in generally describing a normal distribution. As you can see from the two distributions, both type A and type B tiles have the same mean value of area. They also have the same mode (the most frequent value) as indicated by the peak of the distributions. In addition, they both have the same median (This is the value right in the middle of all values). These are all measures of central tendency. The difference between the distributions is in the width of the distribution; this reflects the variability or the dispersion in the data with the wider distribution indicating higher variability or larger range of values. Since it is important to buy tiles that are virtually identical in area (to avoid mismatch of tiles), type A should be the tiles of choice as they exhibit much smaller variability in area than type B.
The variable in the previous example (ceramic tile area) represents the case of ‘nominal-the-best-variable’; that is a variable, which should exhibit an exact value, not too low and not too high.
Other variable types are the ‘larger-the-better’ type, and the ‘smaller-the-better-type’. The Figure here shows expected frequency distributions of these types of variables. Larger the better type variables include: student’s grade, people income, and machine efficiency. Smaller-the better type variables include: number of defects, car accidents, and number of failing students.
In practice, we often deal with a very large population of a process (machine, people, material, etc.). Typically, our goal is to determine the performance of this population through examining one or more population attributes. For example, a company producing millions of electronic items, say DVD players wishes to examine important attributes such as picture resolution and audio performance. Since testing these attributes for the entire population of DVD players is time consuming and can be very costly, the company draws a random sample of few DVD players from the population, tests the two attributes in the sample, produces sample data, and analyzes the data to determine key statistics such as the mean and the range. These systematic steps produce sample descriptive statistics. In doing so, the company’s interest is not the sample per se but rather the whole population of electronic items from which the sample was drawn. In order to estimate population parameters from sample statistics, we use ‘inferential statistics’ as shown here.