Anomaly detection and data imputation within time series
4. six sigma descriptive statistics
1. QUALITY TOOLS &
TECHNIQUES
1
TQ T
SIX SIGMA: STATISTICS
By: -
Hakeem–Ur–Rehman
MS-TQM, M.I.O.M(Operations Research)
Certified Six Sigma Black Belt (Singapore)
Lead Auditor ISO 9001 (UK)
IQTM–PU
3. KEY TERMS
Population (Universe)
All Items of Interest
Sample
Portion of Population
Parameter
Summary Measure about Population
Statistic
Summary Measure about Sample
• P in Population
& Parameter
• S in Sample
& Statistic
4. TYPES OF DATA
Attribute Data (Qualitative)
Is always binary, there are only two possible values (0, 1)
1. Yes, No
2. Success, Failure
3. Go, No Go
4. Pass, Fall
Variable Data (Quantitative)
Discrete (Count) Data:
Can be categorized in a classification and is based on counts.
1. Number of defects
2. Number of defective units
3. Number of Customer Returns
Continuous Data:
Can be measured on a scale, it has decimal subdivisions that are
meaningful
1. Time, Pressure
2. Money
3. Material feed rate
5. DISCRETE & CONTINUOUS
VARIABLES
DISCRETE VARIABLE POSSIBLE VALUES FOR THE
VARIABLE
The number of defective needles in boxes
of 100 diabetic syringes
0,1,2,3 … 100
The number of individuals in groups of 30
with a Type–A Personality
0,1,2,3 … 30
The number of surveys returned out of 300
mailed in a customer satisfaction study.
0,1,2,3 … 300
CONTINUOUS VARIABLE POSSIBLE VALUES FOR THE VARIABLE
The length of prison time served
for individuals convicted
All the real numbers between ‘a’ and ‘b’, where
‘a’ is the smallest amount of time served and ‘b’
is the largest.
The household income for
households with incomes less
than or equal to $30,000
All the real numbers between ‘a’ and $30,000,
where ‘a’ is the smallest household income in the
population.
6. DEFINITIONS OF SCALED DATA
Understanding the nature of data and how to represent it can affect the types of statistical tests
possible.
1. NOMINAL SCALE:
“Numbers representing nominal data can be used only to classify or categorize”;
Data consists of Names, Labels, or categories.
A player with number 30 is not more of anything than a player with number 15,
and is certainly not twice whatever number 15 is.
Few examples of Nominal Data are:
Sex, Religion, Geographic Location, Place of Birth, employee ID Numbers
etc.
2. ORDINAL SCALE:
“Ordinal Level data measurement is higher than the nominal level. In addition to
the nominal level capabilities, Ordinal level measurement can be used to rank or
order objects”.
The Categorization of people or objects, or the ranking of items, Nominal and
Ordinal data are non–metric data and are sometimes referred to as qualitative
data.
EXAMPLES:
“AUTOMOBILES SIZES” Subcompact, compact, intermediate, full size,
luxury
“PRODUCT RATING” Poor, Good, Excellent
“CUSTOMER SATISFACTION” Very poor, Poor, Neither good or bad, Good,
Excellent.
7. DEFINITIONS OF SCALED DATA (Cont…)
3. INTERVAL SCALE:
“The distances between consecutive numbers have meaning and the data are always
numerical”.
FOR EXAMPLE, when measuring temperature (in Fahrenheit), the distance from 30-40
is same as the distance from 70-80. The interval between values is interpretable.
EXAMPLE:
IQ Scores of students in Black Belt Training:
100 … (the difference between scores is measureable and has meaning but
a difference of 20 points between 100 and 120 does not indicate that one
student is 1.2 times more intelligent)
4. RATIO SCALE:
“Data that can be ranked and for which all arithmetic operations including division can
be performed. (Division by Zero is of course excluded) Ratio level data has an
absolute zero and a value of zero indicates a complete absence of the characteristic of
interest”.
FOR EXAMPLE,
Grams of fat consumed per adult in Pakistan
0 … (if person – A consumes 25 grams of fat and person – B consumes 50
grams, we can say that person – B consumes twice as much fat as person –
A. if a person – C consumes ZERO gram of fat per day, we can say there is
a complete absence of fat consumed on that day. Note that a ratio is
interpretable and an absolute zero exists.)
OTHER EXAMPLE:
Production Cycle time, Work measurement time, Number of trucks sold, Number
of employees etc.
8. DEFINITIONS OF SCALED DATA (Cont…)
TYPE OF DATA OPERATOR DESCRIPTION EXAMPLES
Nominal =, ≠ Categories Types of defects,
Types of colors
Ordinal <, > Rankings Severity of defects:
critical, major,
minor
Interval +, - Differences but
no absolute zero
Temperature of a
ship
Ratio / Absolute zero Pressure, Speed
12. DESCRIPTIVE ANALYSIS OF
QUALITATIVE DATA
12
QUALITATIVE DATA
TABLES GRAPHS NUMBERS
One Way Table
Two–Ways Table
.
.
.
N – Ways Table
Bar Chart
Pie Chart
Multiple Bar Chart
Component Bar Chart
Percentages
13. DESCRIPTIVE ANALYSIS OF QUANTITATIVE DATA
13
QUANTITATIVE DATA
TABLES GRAPHS NUMBERS
Frequency Distribution
Stem and Leaf Plot
Histogram
Box and Whisker’s Plot
Center Distribution
Important
Points
Variation
Mean
Median
Mode
Geometric Mean
Harmonic Mean
Trimmed Mean
Median
Quartiles
Deciles
Percentiles
Range
Inter-Quartile Range
Variance
Standard Deviation
Skewness
Kurtosis
14. MINITAB: AN INTRODUCTION
BEGINNING AND ENDING A MINITAB SESSION:
To start a Minitab session from the menu, select
Start All Programs MINITAB 15 English MINITAB 15
English
To exit Minitab, select
File Exit
When you first enter
Minitab, the screen will
appear as in the figure:
The session window contains
comments, tables, descriptive
summaries, and inferential
statistics.
The data window consists of
all the data and variable names.
Graph windows contain high
resolution graphs.
SESSION
WINDOW
DATA
WINDOW
15. DESCRIPTIVE ANALYSIS USING
MINITAB
In the Minitab Data
folder, open the
worksheet Pulse.mtw
Conduct Descriptive
Analysis on the pulse1
data.
16. MEASURES OF LOCATION
Mean is:
Mean is the average of a group of numbers
Applicable for interval and ratio data
Not applicable for nominal or ordinal data
Affected by each value in the data set, including extreme values Computed by
summing all values in the data set and dividing the sum by the number of values
in the data set
Stat Basic Statistics Display Descriptive Statistics:::
Select; Statistics (and choose appropriate measures)
Select; Graphs Histogram of data, with normal curve
SAMPLE:
POPULATION:
Descriptive Statistics: Pulse1
17. MEASURES OF LOCATION
Median is:
Median - middle value in an ordered array of numbers.
For an array with an odd number of terms, the median is the middle number
For an array with an even number of terms the median is the average of the
middle two numbers
Trimmed Mean is a:
Compromise between the MEAN and MEDIAN
1. The Trimmed Mean is calculated by eliminating a specified percentage of the
smallest and largest observations from the data set and then calculating the
average of the remaining observations.
2. Useful for data with potential extreme values.
MODE:
Mode - the most frequently occurring value in a data set
Applicable to all levels of data measurement (nominal, ordinal, interval, and
ratio)
Can be used to determine what categories occur most frequently
Bimodal – In a tie for the most frequently occurring value, two modes are listed
Multimodal -- Data sets that contain more than two modes
18. MEASURES OF VARIATION
RANGE:
The difference between the largest and the smallest values
in a set of data
Advantage – easy to compute
Disadvantage – is affected by extreme values
INTER–QUARTILE RANGE:
Inter-quartile Range - range of values between the first and
third quartile
Range of the “middle half”; middle 50%
Inter-quartile Range – used in the construction of box and
whisker plots
STANDARD DEVIATION:
S =
VARIANCE:
S2 = Square of S
19. SHAPE OF THE DISTRIBUTION
Skewness: indicator used in distribution analysis as a sign of asymmetry and
deviation from a normal distribution.
Skewness > 0 - Right skewed distribution - most values are concentrated
on left of the mean, with extreme values to the right.
Skewness < 0 - Left skewed distribution - most values are concentrated on
the right of the mean, with extreme values to the left.
Skewness = 0 - mean = median, the distribution is symmetrical around
the mean.
Kurtosis - indicator used in distribution analysis as a sign of flattening or
"peakedness" of a distribution.
Kurtosis > 3 - Leptokurtic distribution, sharper than a normal distribution,
with values concentrated around the mean and thicker tails. This means high
probability for extreme values.
Kurtosis < 3 - Platykurtic distribution, flatter than a normal distribution with
a wider peak. The probability for extreme values is less than for a normal
distribution, and the values are wider spread around the mean.
Kurtosis = 3 - Mesokurtic distribution - normal distribution for example.
20. INTRODUCTION TO GRAPHING
The purpose of Graphing is to:
1. To identify the shape of distribution of data
2. To locate the Average, Spread and Outliers of
the Distribution
3. To compare the shapes and variation of different
variables
4. To observe the trends, drifts and shifts in the
collected data
Here we will discuss …
Histogram
Box Plots (Box & Whisker’s Plot)
21. INTRODUCTION TO GRAPHING
(Cont…)
When you start Minitab–15, if your tool bars do not look like the figure below,
Do the following to get the tools where you need them. Click on Tools
Customize Toolbars tab.
In the dialog box that opens, check and uncheck as needed so that it matches the
figure to the below.
22. WHAT IS A HISTOGRAM?
A histogram is a summary graph showing distribution of
data points measured that falls within various class-
intervals.
WHAT QUESTIONS THE ‘HISTOGRAM’ ANSWERS?
What distribution (center, variation and shape) does the data
have?
Does the data look symmetric or is it skewed to the left or right?
Does the data contain outliers?
Is Process within Specification Limits? 22
23. GUIDELINES FOR
CONSTRUCTING A HISTOGRAM
1. Determine the number of data points in the data set. Call this number ‘n’.
2. Determine the range, R, of the values in the data set.
3. Determine the number of classes; there are no set rules; however, there are
some rules of thumb that can be used.
a) # if Classes = 1 + 3.3 log(n)
b) The logarithm (base 2) rule.
# of Classes = K = [log2n] + 1 = [(log n) / (log 2)] + 1
c) Following table [Goal 88] gives a range of classes.
# of Classes = K =
4. Determine the class width by dividing the range (R) by the number of classes
(K) and rounding up.
23
24. THE HISTOGRAM
Open Bears.MTW
You will create a frequency histogram of the variable Age.
25. THE HISTOGRAM (Cont…)
CONTROLLING HISTOGRAMS:
What you get in this case is a histogram with 10 classes.
To get the right number of classes, get into the "X Scale" editing dialog box and click on
the "Binning" tab.
For "Interval Type" click on "Cut point" and for "Interval Definition" click on "Number of
intervals:" and change it to 6; Now click "OK“
This graph still does not conform to standards because the class width and class
boundaries were not calculated according to rules. To get what we want, we must
define the class boundaries (what Minitab calls "cut points") ourselves.
The minimum value of the data is 8 and the maximum is 177. Our formula for the class
width with 6 classes is (177–8)/6 = 28.5..., which rounds up to 29. (Remember; always
round up unless the fraction yields an integer.) If we choose 8 as the lowest class limit,
then the lowest class boundary will be 7.5, and the rest will be 36.5, 65.5, 94.5, 123.5,
152.5 and 181.5.
Now get back into the "Binning" dialog box, click on "Midpoint/Cutpoint positions:",
delete the existing cutpoints then enter the first 2 class boundaries listed above into
the box (separate with spaces, not commas) and click "OK".
26. THE HISTOGRAM (Cont…)
EXERCISE:
The data in C:Program FilesMinitab
15EnglishSample Data Grades.MTW consists of
verbal and math SAT scores and corresponding
GPA's.
i. Create a frequency histogram with 7 classes of
the verbal SAT scores.
ii. Create a relative frequency histogram with 7
classes of the verbal SAT scores.
iii. Create a frequency polygon with 7 classes of the
verbal SAT scores.
27. BOX & WHISKER’S PLOT
Use a Box & Whisker’s Plot to
assess and compare
distribution characteristics
such as median, range, and
symmetry, and to identify
outliers.
A minimum of 10
observations should be
included in generating the
Box Plot.
27
28. BOX & WHISKER’S PLOT USING MINITAB
CONSTRUCTING BOX PLOT (One Y):
You want to examine the overall
durability of your carpet products.
Samples of the carpet products are
placed in four homes and you
measure durability after 60 days.
Create a Box Plot to examine the
distribution of durability scores.
Open worksheet Carpet.mtw
Choose Graph Boxplot
Under One Y, Choose Simple, Click
Ok
In Variable, enter Durability. Click
ok
28
29. BOX & WHISKER’S PLOT USING MINITAB
29
Constructing Box Plot: (One Y–with Groups)
You want to assess the durability of
four experimental carpet products.
Samples of the carpet products are
placed in four homes and you
measure durability after 60 days.
Create a box plot with median labels
and color-coded boxes to examine the
distribution of durability for each
carpet product.
Open the worksheet CARPET.MTW
31. BOX & WHISKER’S PLOT USING MINITAB
31
Constructing Box Plot: (One Y–with Groups)
(Cont…)
Interpreting the results:
Median durability is highest for Carpet 4 (19.75). However, this product also demonstrates
the greatest variability, with an inter-quartile range of 9.855. In addition, the distribution is
negatively skewed, with at least one durability measurement of about 10.
Carpets 1 and 3 have similar median durability's (13.52 and 12.895, respectively). Carpet
3 also exhibits the least variability, with an inter-quartile range of only 2.8925.
Median durability for Carpet 2 is only 8.625. This distribution and that of Carpet 1 are
positively skewed, with inter-quartile ranges of about 5-6.
32. BOX & WHISKER’S PLOT
32
33.2 29.1 34.5 32.6 30.7 34.9 30.2 31.8 30.8 33.5
29.4 32.2 33.6 30.4 31.9 32.8 26.8 29.2 31.8 27.4
36.5 38.1 30.0 29.5 36.0 31.5 27.4 30.4 28.4 31.8
29.8 34.6 32.3 28.2 27.5 28.8 28.4 27.7 27.8 30.5
28.5 28.5 27.5 28.6 29.1 26.9 34.2 28.5 34.8 30.5
EXERCISE # 1:
A random sample of 50 observations on the mileage per gallon of a
particular brand of gasoline is shown:
Develop Box & Whisker’s Plot for analyzing the data.
EXERCISE # 2:
The following data represent the percentage of calories that come from fat
for burgers and chicken items from a sample of fast food chains.
BURGER
43 51 48 47 51 50 55 55 59 57
CHICKEN
60 54 53 57 57 46 45 56 57
Construct the Box & Whisker’s for analyze the data.
33. PROBABILITY
DISTRIBUTION OF DATA
Data generating process of the data is known as
Distribution of the Data.
For Example:
In the manufacturing sector the measurements
such as length, diameter, etc usually follow
NORMAL Distribution
In Service sector say Banks, the customer
waiting Time follow EXPONETIONAL
Distribution
In Service sector say Banks, the number of
customers arriving follow POISSON Distribution
34. NORMAL DISTRIBUTION
Characteristics of the normal distribution:
Continuous distribution - Line does not break
Symmetrical distribution - Each half is a mirror of the other half
Asymptotic to the horizontal axis - it does not touch the x axis and goes on
forever
Unimodal - means the values mound up in only one portion of the graph
Area under the curve = 1; total of all probabilities = 1
Normal distribution is characterized by the mean and the Std Dev
Values of μ and σ produce a normal distribution
...2.71828
...3.14159=
Xofdeviationstandard
Xofmean
:
2
1
)(
2
2
1
e
Where
x
xf e
X
35. STANDARD NORMAL DISTRIBUTION
A normal distribution with
a mean of zero, and
a standard deviation of one
Z Formula
standardizes any normal
distribution
Z Score
computed by the Z Formula
the number of standard
deviations which a value
is away from the mean
X
Z
1
0
36. NORMALITY TEST FROM
GRAPHIC SUMMARY OF DATA
Open the worksheet CRANKSH.MTW
If Sk < 0, the distribution is
negatively skewed (skewed to
the left).
If Sk = 0, the distribution is
symmetric (not skewed).
If Sk > 0, the distribution is
positively skewed (skewed to the
right).
The value of Skewness shows
data is not normal.
P – Value is less than 5% (Value
of Alpha (mean level of
significance)); shows data is not
normal
If ‘P’ value is > alpha; Data is Normal;
otherwise it will be Not-Normal
37. NORMALITY TEST (Cont…)
NORMALLY TEST:
o Generate a normal probability plot and performs a hypothesis test to
examine whether or not the observations follow a normal distribution.
For the normality test, the hypothesis are,
o Ho: Data follow a normal distribution Vs H1: Data do not follow a
normal distribution
o If ‘P’ value is > alpha; Accept Null Hypothesis (Ho)
NORMALITY TEST:
In an operating engine, parts of the crankshaft move up
and down. AtoBDist is the distance (in mm) from the
actual (A) position of a point on the crankshaft to a
baseline (B) position. To ensure production quality, a
manager took five measurements each working day in a
car assembly plant, from September 28 through October
15, and then ten per day from the 18th through the 25th.
You wish to see if these data follow a normal
distribution,
so you use Normality test.
Open the worksheet CRANKSH.MTW
38. 38
INTERPRETING THE RESULTS:
The graphical output is a plot of normal probabilities versus the data. The data
depart from the fitted line most evidently in the extremes, or distribution tails.
The Anderson–Darling test’s ‘p–value’ indicates that, at a levels greater than
0.022, there is evidence that the data do not follow a normal distribution.
There is a slight tendency for these data to be lighter in the tails than a normal
distribution because the smallest points are below the line and the largest point is
just above the line.
A distribution with heavy tails would show the opposite pattern at the extremes.
NORMALITY TEST (Cont…)
39. SCATTER PLOT
WHAT IS A SCATTER PLOT?
Is a graphical presentation of any possible relationship
between two sets of variables by a simple X-Y plot,
which may or may not be dependent.
39
41. SCATTER PLOT
EXAMPLE: You are interested in how well your
company's camera batteries are meeting customers'
needs. Market research shows that customers become
annoyed if they have to wait longer than 5.25 seconds
between flashes.
You collect a sample of batteries that have been in use
for varying amounts of time and measure the voltage
remaining in each battery immediately after a flash
(VoltsAfter), as well as the length of time required for
the battery to be able to flash again (flash recovery time,
FlashRecov). Create a scatter plot to examine the
results. Include a reference line at the critical flash
recovery time of 5.25 seconds.
Open the worksheet BATTERIES.MTW 41
43. SCATTER PLOT
INTERPRETING THE RESULTS:
As expected, the lower the voltage in a battery after a flash, the
longer the flash recovery time tends to be.
The reference line helps to illustrate that there were many flash
recovery times greater than 5.25 seconds. 43