Weitere ähnliche Inhalte
Ähnlich wie Exploratory Data Analysis - Checking For Normality (12)
Mehr von Azmi Mohd Tamil (20)
Kürzlich hochgeladen (20)
Exploratory Data Analysis - Checking For Normality
- 2. Introduction
Method of Exploring and
Summarising Data differs
According to Types of Variables
©drtamil@gmail.com 2012
- 3. Dependent/Independent
Independent Variables
Food Intake Frequency of Exercise
Obesity
Dependent Variable ©drtamil@gmail.com 2012
- 5. Explore
4 Itis the first step in the analytic process
4 to explore the characteristics of the data
4 to screen for errors and correct them
4 to look for distribution patterns - normal
distribution or not
4 May require transformation before further
analysis using parametric methods
4 Or may need analysis using non-parametric
techniques
©drtamil@gmail.com 2012
- 6. Data Screening
PARITY
Frequency Percent
4 By running Valid 1 67 30.7
frequencies, we may 2 44 20.2
3 36 16.5
detect inappropriate 4 22 10.1
responses 5 21 9.6
6 8 3.7
4 How many in the
7 3 1.4
audience have 15 8 7 3.2
children and 9 5 2.3
10 3 1.4
currently pregnant 11 1 .5
with the 16th? 15 1 .5
Total 218 100.0
©drtamil@gmail.com 2012
- 7. Data Screening
4 See whether the
data make sense or
not.
4 E.g. Parity 10 but
age only 25.
©drtamil@gmail.com 2012
- 10. Data Screening
4 By looking at measures of central tendency
and range, we can also detect abnormal values
for quantitative data
Descriptive Statistics
Std.
N Minimum Maximum Mean Deviation
Pre-pregnancy weight 184 32 484 53.05 33.37
Valid N (listwise) 184
©drtamil@gmail.com 2012
- 11. Interpreting the Box Plot
Outlier
Largest non-outlier The whiskers extend
to 1.5 times the box
width from both ends
Upper quartile of the box and ends
at an observed value.
Three times the box
Median width marks the
boundary between
"mild" and "extreme"
Lower quartile outliers.
"mild" = closed dots
Smallest non-outlier
Outlier"extreme"= open dots
©drtamil@gmail.com 2012
- 12. Data Screening
600
4 We can
also make 500
73
use of 400
graphical
tools such 300
as the box
200
plot to
detect 100
181
211
198
141
wrong
0
data entry N= 184
Pre-pregnancy weight
©drtamil@gmail.com 2012
- 13. Data Cleaning
4 Identify the extreme/wrong values
4 Check with original data source – i.e.
questionnaire
4 If incorrect, do the necessary correction.
4 Correction must be done before
transformation, recoding and analysis.
©drtamil@gmail.com 2012
- 14. Parameters of Data
Distribution
4 Mean – central value of data
4 Standard deviation – measure of how
the data scatter around the mean
4 Symmetry (skewness) – the degree of
the data pile up on one side of the mean
4 Kurtosis – how far data scatter from the
mean
©drtamil@gmail.com 2012
- 15. Normal distribution
4 The Normal distribution is
represented by a family of curves
defined uniquely by two parameters,
which are the mean and the
standard deviation of the population.
4 The curves are always
symmetrically bell shaped, but the
extent to which the bell is
compressed or flattened out
depends on the standard deviation
of the population.
4 However, the mere fact that a curve
is bell shaped does not mean that it
represents a Normal distribution,
because other distributions may
have a similar sort of shape.
©drtamil@gmail.com 2012
- 16. Normal distribution
4 If the observations follow a 99.7%
Normal distribution, a range 95.4%
covered by one standard
68.3%
deviation above the mean
and one standard deviation
below it includes about
68.3% of the observations;
4 a range of two standard
deviations above and two
below (+ 2sd) about 95.4%
of the observations; and
4 of three standard deviations
above and three below (+
3sd) about 99.7% of the
observations
©drtamil@gmail.com 2012
- 17. Normality
4 Why bother with normality??
4 Because it dictates the type of analysis
that you can run on the data
©drtamil@gmail.com 2012
- 18. Normality-Why?
Parametric
Qualitative Quantitative Normally distributed data Student's t Test
Dichotomus
Qualitative Quantitative Normally distributed data ANOVA
Polinomial
Quantitative Quantitative Repeated measurement of the Paired t Test
same individual & item (e.g.
Hb level before & after
treatment). Normally
distributed data
Quantitative - Quantitative - Normally distributed data Pearson Correlation
continous continous & Linear
Regresssion
©drtamil@gmail.com 2012
- 19. Normality-Why?
Non-parametric
Qualitative Quantitative Data not normally distributed Wilcoxon Rank Sum
Dichotomus Test or U Mann-
Whitney Test
Qualitative Quantitative Data not normally distributed Kruskal-Wallis One
Polinomial Way ANOVA Test
Quantitative Quantitative Repeated measurement of the Wilcoxon Rank Sign
same individual & item Test
Quantitative - Quantitative - Data not normally distributed Spearman/Kendall
continous/ordina continous Rank Correlation
l
©drtamil@gmail.com 2012
- 20. Normality-How?
4 Explored statistically
4 Explored graphically
• Kolmogorov-Smirnov
• Histogram
statistic, with
• Stem & Leaf Lilliefors significance
• Box plot level and the
• Normal probability Shapiro-Wilks
plot statistic
• Detrended normal • Skew ness (0)
plot • Kurtosis (0)
– + leptokurtic
– 0 mesokurtik
– - platykurtic
©drtamil@gmail.com 2012
- 21. Kolmogorov- Smirnov
4 In the 1930’s, Andrei Nikolaevich
Kolmogorov (1903-1987) and N.V.
Smirnov (his student) came out with the
approach for comparison of distributions
that did not make use of parameters.
4 This is known as the Kolmogorov-
Smirnov test.
©drtamil@gmail.com 2012
- 22. Skew ness
4 Skewed to the right
indicates the
presence of large
extreme values
4 Skewed to the left
indicates the
presence of small
extreme values
©drtamil@gmail.com 2012
- 23. Kurtosis
4 For symmetrical
distribution only.
4 Describes the shape
of the curve
4 Mesokurtic -
average shaped
4 Leptokurtic - narrow
& slim
4 Platikurtic - flat &
wide ©drtamil@gmail.com 2012
- 24. Skew ness & Kurtosis
4 Skew ness ranges from -3 to 3.
4 Acceptable range for normality is skew ness
lying between -1 to 1.
4 Normality should not be based on skew ness
alone; the kurtosis measures the “peak ness”
of the bell-curve (see Fig. 4).
4 Likewise, acceptable range for normality is
kurtosis lying between -1 to 1.
©drtamil@gmail.com 2012
- 26. Normality - Examples
Graphically
60
50
40
30
20
10 Std. Dev = 5.26
Mean = 151.6
0 N = 218.00
140.0 145.0 150.0 155.0 160.0 165.0
142.5 147.5 152.5 157.5 162.5 167.5
Height ©drtamil@gmail.com 2012
- 27. Q&Q Plot
4 This plot compares the quintiles of a data
distribution with the quintiles of a standardised
theoretical distribution from a specified family
of distributions (in this case, the normal
distribution).
4 If the distributional shapes differ, then the
points will plot along a curve instead of a line.
4 Take note that the interest here is the central
portion of the line, severe deviations means
non-normality. Deviations at the “ends” of the
curve signifies the existence of outliers.
©drtamil@gmail.com 2012
- 28. Normality - Examples
Graphically
Normal Q-Q Plot of Height
3
2
1
0
Detrended Normal Q-Q Plot of Height
Expected Normal
-1 .6
.5
-2
.4
-3 .3
130 140 150 160 170
.2
Observed Value
Dev from Normal
.1
0.0
-.1
-.2
130 140 150 160 170
Observed Value ©drtamil@gmail.com 2012
- 30. Normality - Examples
Statistically
Descriptives
Statistic Std. Error
Height Mean 151.65 .356
95% Confidence Lower Bound 150.94
Interval for Mean Upper Bound Normal distribution
152.35
Mean=median=mode
5% Trimmed Mean 151.59
Median 151.50
Variance 27.649 Skewness & kurtosis
Std. Deviation 5.258
Minimum 139
within +1
Maximum 168
Range 29
Interquartile Range 8.00
p > 0.05, so normal
Skewness .148 .165 distribution
Kurtosis .061 .328
Tests of Normality
a
Kolmogorov-Smirnov
Shapiro-Wilks; only if Statistic df Sig.
sample size less than 100. Height .060 218 .052
a. Lilliefors Significance Correction
©drtamil@gmail.com 2012
- 32. K-S Test
4 very sensitive to the sample sizes of the
data.
4 For small samples (n<20, say), the
likelihood of getting p<0.05 is low
4 for large samples (n>100), a slight
deviation from normality will result in
being reported as abnormal distribution
©drtamil@gmail.com 2012
- 34. Normality
Transformation
Normal Q-Q Plot of PARITY
Normal Q-Q Plot of PARITY
33
22
11
Normal Q-Q Plot of LN_PARIT
Normal Q-Q Plot of LN_PARIT
00 3
3
Expected Normal
Expected Normal
-1
-1
2
2
-2
-2
00 22 44 66 88 10
10 12
12 14
14 16
16
Observed Value
Observed Value
1
1
0
0
Expected Normal
Expected Normal
-1
-1
-2
-2
-.5
-.5 0.0
0.0 .5
.5 1.0
1.0 1.5
1.5 2.0
2.0 2.5
2.5 3.0
3.0
Observed Value
Observed Value ©drtamil@gmail.com 2012
- 35. TYPES OF TRANSFORMATIONS
Square root Logarithm Inverse
Reflect and square Reflect and logarithm Reflect and inverse
root
©drtamil@gmail.com 2012
- 36. Summarise
4 Summarise a large set of data by a few
meaningful numbers.
4 Single variable analysis
• For the purpose of describing the data
• Example; in one year, what kind of cases are
treated by the Psychiatric Dept?
• Tables & diagrams are usually used to describe
the data
• For numerical data, measures of central tendency
& spread is usually used
©drtamil@gmail.com 2012
- 37. Frequency Table
Race F %
Malay 760 95.84%
Chinese 5 0.63%
Indian 0 0.00%
Others 28 3.53%
TOTAL 793 100.00%
•Illustrates the frequency observed for each
category
©drtamil@gmail.com 2012
- 38. Frequency
Distribution Table
• > 20 observations, best Umur Bil %
presented as a frequency 0-0.99 25 3.26%
1-4.99 78 10.18%
distribution table. 5-14.99 140 18.28%
•Columns divided into class & 15-24.99 126 16.45%
25-34.99 112 14.62%
frequency. 35-44.99 90 11.75%
•Mod class can be determined 45-54.99 66 8.62%
55-64.99 60 7.83%
using such tables. 65-74.99 50 6.53%
75-84.99 16 2.09%
85+ 3 0.39%
JUMLAH 766
©drtamil@gmail.com 2012
- 42. Mean
4 theaverage of the data collected
4 To calculate the mean, add up the
observed values and divide by the
number of them.
4A major disadvantage of the mean is
that it is sensitive to outlying points
©drtamil@gmail.com 2012
- 43. Mean: Example
412, 13, 17, 21, 24, 24, 26, 27, 27,
30, 32, 35, 37, 38, 41, 43, 44, 46,
53, 58
4Total of x = 648
4n= 20
4Mean = 648/20 = 32.4
©drtamil@gmail.com 2012
- 44. Measures of variation -
standard deviation
4 tells us how much all the scores in a dataset cluster around the
mean. A large S.D. is indicative of a more varied data scores.
4 a summary measure of the differences of each observation from
the mean.
4 If the differences themselves were added up, the positive would
exactly balance the negative and so their sum would be zero.
4 Consequently the squares of the differences are added.
©drtamil@gmail.com 2012
- 46. sd: Example
x x
4 12, 13, 17, 21, 24, 24, (x-mean)^2 (x-mean)^2
12 416.16 32 0.16
26, 27, 27, 30, 32, 35, 13 376.36 35 6.76
37, 38, 41, 43, 44, 46, 17 237.16 37 21.16
53, 58 21 129.96 38 31.36
24 70.56 41 73.96
4 Mean = 32.4; n = 20
24 70.56 43 112.36
4 Total of(x-mean)2 26 40.96 44 134.56
= 3050.8 27 29.16 46 184.96
27 29.16 53 424.36
4 Variance = 3050.8/19
30 5.76 58 655.36
= 160.5684 TOTAL 1405.8 TOTAL 1645
4 sd = 160.56840.5=12.67
©drtamil@gmail.com 2012
- 47. Median
4 the ranked value that lies in the middle
of the data
4 the point which has the property that half
the data are greater than it, and half the
data are less than it.
4 if n is even, average the n/2th largest
and the n/2 + 1th largest observations
4 "robust" to outliers
©drtamil@gmail.com 2012
- 48. Median:
4 12, 13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
4 (20+1)/2 = 10th which is 30, 11th is 32
4 Therefore median is (30 + 32)/2 = 31
©drtamil@gmail.com 2012
- 49. Measures of variation -
quartiles
4 The range is very susceptible to what
are known as outliers
4A more robust approach is to divide the
distribution of the data into four, and find
the points below which are 25%, 50%
and 75% of the distribution. These are
known as quartiles, and the median is
the second quartile.
©drtamil@gmail.com 2012
- 50. Quartiles
4 12, 13, 17, 21, 24,
24, 26, 27, 27, 30,
32, 35, 37, 38, 41,
43, 44, 46, 53, 58
4 25th percentile 24; (24+24)/2
4 50th percentile 31; (30+32)/2 ; = median
4 75th percentile 42.5; (41+43)/2
©drtamil@gmail.com 2012
- 51. Mode
4 The most frequent occurring number.
E.g. 3, 13, 13, 20, 22, 25: mode = 13.
4 It is usually more informative to quote
the mode accompanied by the
percentage of times it happened; e.g.,
the mode is 13 with 33% of the
occurrences.
©drtamil@gmail.com 2012
- 52. Mode: Example
4 12,13, 17, 21, 24, 24, 26, 27, 27, 30,
32, 35, 37, 38, 41, 43, 44, 46, 53, 58
4 Modes are 24 (10%) & 27 (10%)
©drtamil@gmail.com 2012
- 53. Mean or Median?
4 Which measure of central tendency
should we use?
4 if the distribution is normal, the mean+sd
will be the measure to be presented,
otherwise the median+IQR should be
more appropriate.
©drtamil@gmail.com 2012
- 57. Graphing Categorical Data:
Univariate Data
Categorical Data
Graphing Data
Tabulating Data
The Summary Table
Pie Charts
CD
S avings
B onds Bar Charts Pareto Diagram
S toc ks
45 120
40
0 10 20 30 40 50 100
35
30 80
25
60
20
15 40
10
20
5
0 0
S toc ks B onds S avings CD
©drtamil@gmail.com 2012
- 58. Bar Chart
80
69
60
40
20
20
Percent
11
0
Housew ife Office w ork Field w ork
Type of work
©drtamil@gmail.com 2012
- 60. Tabulating and Graphing
Bivariate Categorical Data
4 Contingency tables:
Table 1: Contigency table of pregnancy induced hypertension and
SGA
Count
SGA
Normal SGA Total
Pregnancy induced No 103 94 197
hypertension Yes 5 16 21
Total 108 110 218
©drtamil@gmail.com 2012
- 61. Tabulating and Graphing
Bivariate Categorical Data
120
4 Side
100
by 103
94
side 80
charts
60
40
SGA
20
Normal
Count
16
0 SGA
No Yes
Pregnancy induced hypertension
©drtamil@gmail.com 2012
- 63. Tabulating and Graphing
Numerical Data
Numerical Data 41, 24, 32, 26, 27, 27, 30, 24, 38, 21
Frequency Distributions
Ordered Array Ogive
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Cumulative Distributions 120
100
80
60
40
20
0
2 144677 Area
10 20 30 40 50 60
Stem and Leaf Histograms
3 028
Display 7
6
4 1
5
4
Tables 3
2
1
Polygons
0
10 20 30 40 50 60
©drtamil@gmail.com 2012
- 64. Tabulating Numerical Data:
Frequency Distributions
4 Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
4 Find range: 58 - 12 = 46
4 Select number of classes: 5 (usually between 5 and 15)
4 Compute class interval (width): 10 (46/5 then round up)
4 Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
4 Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95
4 Count observations & assign to classes
©drtamil@gmail.com 2012
- 65. Frequency Distributions
and Percentage Distributions
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Midpoint Freq %
10.0 - 19.9 14.95 3 15%
20.0 - 29.9 24.95 6 30%
30.0 - 39.9 34.95 5 25%
40.0 - 49.9 44.95 4 20%
50.0 - 59.9 54.95 2 10%
TOTAL 20 100%
©drtamil@gmail.com 2012
- 66. Graphing Numerical Data:
The Histogram
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
7
6
6
5
5
Frequency
4
4
3 No Gaps
3
Between
2
2
Bars
1
0
14.95 24.95 34.95 44.95 54.95
Age
Class Boundaries
Class Midpoints ©drtamil@gmail.com 2012
- 67. Graphing Numerical Data:
The Frequency Polygon
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
7
6
5
4
3
2
1
0
14.95 24.95 34.95 44.95 54.95
Class Midpoints ©drtamil@gmail.com 2012
- 68. Calculate Measures of
Central Tendency & Spread
4 We can use frequency distribution table
to calculate;
• Mean
• Standard Deviation
• Median
• Mode
©drtamil@gmail.com 2012
- 69. Mean
X=
∑ f .mp
n Class Midpoint Freq freq x m.p.
4 Mean = 659/20 10.0 - 19.9 14.95 3 44.85
= 32.95 20.0 - 29.9 24.95 6 149.70
4 Compare with 32.4
30.0 - 39.9 34.95 5 174.75
from direct
40.0 - 49.9 44.95 4 179.80
calculation.
50.0 - 59.9 54.95 2 109.90
TOTAL 20 659.00
©drtamil@gmail.com 2012
- 70. Standard deviation
2
( ∑ f .mp )
∑ f .mp 2
−
n
s= Mid
n −1 Class Point Freq f.m.p. f.mp^2
14.95 3 44.85
s2=((24634.05-(6592/20))/19) 10.0 - 19.9 670.51
s2=2920.05/19 20.0 - 29.9 24.95 6 149.70 3735.02
s2=153.69 30.0 - 39.9 34.95 5 174.75 6107.51
s = 12.4
40.0 - 49.9 44.95 4 179.80 8082.01
4 Compare with 12.67 from
direct measurement. 50.0 - 59.9 54.95 2 109.90 6039.01
TOTAL 20 659.00 24634.05
©drtamil@gmail.com 2012
- 71. Median
Class Freq 4 L1 +i *((n+1)/2) – f1
fmed
10.0 - 19.9 3 4 f1 = cumulative freq
above median class
20.0 - 29.9 6
4 29.95 + 10((21/2)-9)
30.0 - 39.9 5 median class
5
40.0 - 49.9 4
4 29.95 + 15/5 = 32.95
4 From direct calculation,
50.0 - 59.9 2 median = 31
TOTAL 20
©drtamil@gmail.com 2012
- 72. Mode
=L1 +i *(Diff1/(Diff1+Diff2))
Class Freq
=19.95 + 10(3/(3+1))
=27.45 10.0 - 19.9 3
20.0 - 29.9 6 mode class
4 Compare with 30.0 - 39.9 5
modes of 24 & 27 40.0 - 49.9 4
from direct
50.0 - 59.9 2
calculation.
TOTAL 20
©drtamil@gmail.com 2012
- 75. Survival Function
1.2
1.0
.8
.6
.4
C S rvival
um u
.2
Survival Function
0.0 Censored
0 1 2 3 4 5 6 7
DURATION
©drtamil@gmail.com 2012
- 76. Principles of Graphical
Excellence
4 Presents data in a way that provides
substance, statistics and design
4 Communicates complex ideas with clarity,
precision and efficiency
4 Gives the largest number of ideas in the
most efficient manner
4 Almost always involves several
dimensions
4 Tells the truth about the data
©drtamil@gmail.com 2012