SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Know your Data
(Background Mathematics)
DATA MINING AND
DATA WAREHOUSING
CS- 6301 (Cr-4)
Dr. Hrudaya Kumar Tripathy
Basic Statistical Descriptions of Data
The entire subject of statistics is based around the idea that we have this
big set of data, and we want to analyse that set in terms of the
relationships between the individual points in that data set.
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube
Measuring the Central Tendency
• The most common and effective numeric
measure of the “center” of a set of data is
the (arithmetic) mean. Let x1,x2, ...... ,xN be
a set of N values or observations, such as
for some numeric attribute X, like salary.
• Sometimes, each value xi in a set may be
associated with a weight wi for i = 1, .... ,N.
The weights reflect the significance,
importance, or occurrence frequency
attached to their respective values. In this
case, we can compute
Mean (algebraic measure) (sample vs. population):
Note: N is population size.
The mean of this set of values is
This is called the weighted arithmetic
mean or the weighted average.
• A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any
outliers. Outliers can affect the mean (especially if there are just one or two very large values), so
a trimmed mean can often be a better fit for data sets with erratic high or low values or for
extremely skewed distributions. Even a small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by that of a few
highly paid managers. Similarly, the mean score of a class in an exam could be pulled down quite a
bit by a few very low scores.
• Which is the mean obtained after chopping off values at the high and low extremes.
• Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99.
• Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three
values: 60, 81, 83, 91, 99.
• Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85.
Measuring the Central Tendency
• Median:
• The median of a set of data is the middlemost number in the set. The
median is also the number that is halfway into the set.
• To find the median, the data should first be arranged in order from
least to greatest.
• Middle value if odd number of values, or average of the middle two
values otherwise
• What will be the median estimated by interpolation (for grouped
data)?
Measuring the Central Tendency
• Mode
• The mode for a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative attributes.
• It is possible for the greatest frequency to correspond to several different
values, which results in more than one mode.
• Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then there is no
mode.
• Example:
• 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
• 62 appears three times, more often than the other values, so Mode = 62
Measuring the Central Tendency
Mean, Median and Mode from Grouped Frequencies
The Race and the Naughty Puppy
• This starts with some raw data (not a grouped frequency yet) ...
To find the Mean Alex adds up all the numbers, then divides by how many numbers:
Mean = (59+65+61+62+53+55+60+70+64+56+58+58+62+62+68+65+56+59+68+61+67)/21
= 61.38095...
• To find the Median Alex places the numbers in value order and finds the
middle number.
• In this case the median is the 11th number:
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
• Median = 61
• To find the Mode, or modal value, Alex places the numbers in value order
then counts how many of each number. The Mode is the number which
appears most often (there can be more than one mode):
53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70
• 62 appears three times, more often than the other values, so Mode = 62
Mean, Median and Mode from Grouped Frequencies
• Grouped Frequency Table
• Alex then makes a Grouped Frequency Table:
• So 2 runners took between 51 and 55 seconds, 7 took
between 56 and 60 seconds, etc
Mean, Median and Mode from Grouped Frequencies
We can estimate the Mean by using the midpoints.
Let's now make the table using midpoints:
Estimating the Mean from Grouped Data
Estimating the Mean from Grouped Data
• Our thinking is: "2 people took 53 sec, 7 people took 58
sec, 8 people took 63 sec and 4 took 68 sec". In other
words we imagine the data looks like this:
53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68
• Then we add them all up and divide by 21. The quick way
to do it is to multiply each midpoint by each frequency:
And then our estimate of the mean time to complete the
race is:
Estimating the Median from Grouped Data
• Let's look at our data again:
• The median is the middle value, which in our
case is the 11th one, which is in the 61 - 65
group:
• We can say "the median group is 61 - 65"
• But if we want an estimated Median value we
need to look more closely at the 61 - 65 group.
• At 60.5 we already have 9 runners, and by the next
boundary at 65.5 we have 17 runners.
• By drawing a straight line in between we can pick out
where the median frequency of n/2 runners is:
Estimating the Median from Grouped Data
And this handy formula does the calculation:
where:
L is the lower class boundary of the group
containing the median
n is the total number of values
B is the cumulative frequency of the groups before
the median group
G is the frequency of the median group
w is the group width
For our example:
L = 60.5
n = 21
B = 2 + 7 = 9
G = 8
w = 5
width
freq
lfreqn
Lmedian
median
)
)(2/
(1


Estimating the Mode from Grouped Data
• Again, looking at our data:
• We can easily find the modal group (the group with the highest
frequency), which is 61 - 65
• We can say "the modal group is 61 - 65"
• But the actual Mode may not even be in that group! Or there may
be more than one mode. Without the raw data we don't really
know. But, we can estimate the Mode using the following
formula:
where:
L is the lower class boundary of the modal group fm is the frequency of the modal
group
fm-1 is the frequency of the group before the modal group w is the group width
f is the frequency of the group after the modal group
L = 60.5
fm-1 = 7
fm = 8
fm+1 = 4
w = 5
For our example:
Baby Carrots Example
Example: You grew fifty baby carrots using
special soil. You dig them up and measure
their lengths (to the nearest mm) and
group the results:
Age Example
The ages of the 112 people who live on
a tropical island are grouped as follows:
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data
• Data in most real applications are not symmetric (a). They may instead be either
positively skewed (b), where the mode occurs at a value that is smaller than the
median or negatively skewed (c), where the mode occurs at a value greater than
the median.
Measuring the Dispersion of Data
• Range, Quartiles, Variance, Standard Deviation, and Interquartile Range
• We now look at measures to assess the dispersion or spread of numeric data.
The measures include range, quartiles, percentiles, and the interquartile
range.
• The five-number summary, which can be displayed as a boxplot, is useful in
identifying outliers.
• Variance and standard deviation also indicate the spread of a data
distribution.
Range
• Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X.
• The range of the set is the difference between the largest (max()) and smallest (min())
values.
Standard Deviation
• The standard deviation (usually abbreviated SD, sd, or just s) of a bunch of numbers tells
you how much the individual numbers tend to differ (in either direction) from the mean.
It’s calculated as follows:
Measuring the Dispersion of Data (cont..)
This formula is saying that you calculate the
standard deviation of a set of N numbers (Xi) by
subtracting the mean from each value to get the
deviation (di) of each value from the mean,
squaring each of these deviations, adding up the
(di)2 terms, dividing by N – 1, and then taking the
square root.
Standard Deviation (cont.)
• This is almost identical to the formula for the root-
mean-square deviation of the points from the mean,
except that it has N – 1 in the denominator instead of
N.
• This difference occurs because the sample mean is
used as an approximation of the true population
mean (which you don’t know). If the true mean were
available to use, the denominator would be N.
• When talking about population distributions, the SD
describes the width of the distribution curve. The
figure shows three normal distributions. They all
have a mean of zero, but they have different
standard deviations and, therefore, different widths.
Each distribution curve has a total area of exactly 1.0,
so the peak height is smaller when the SD is larger.
For an IQ example (84, 84, 89, 91, 110, 114, and 116)
where the mean is 98.3, you calculate the SD as follows:
Standard deviations are very sensitive to extreme
values (outliers) in the data. For example, if the highest
value in the IQ dataset had been 150 instead of 116,
the SD would have gone up from 14.4 to 23.9.
Measuring the Dispersion of Data (cont..)
Why n-1 in Standard Deviation?
Why divide by n-1 rather than n?
You compute the difference between each value and the mean of those values. You don't
know the true mean of the population; all you know is the mean of your sample. Except
for the rare cases where the sample mean happens to equal the population mean, the
data will be closer to the sample mean than it will be to the true population mean.
So the value you compute will probably be a bit smaller (and can't be larger) than what it
would be if you used the true population mean.
To make up for this, divide by n-1 rather than n. This is called Bessel's correction.
But why n-1? If you knew the sample mean, and all but one of the values, you could
calculate what that last value must be. Statisticians say there are n-1 degrees of freedom.
Bessel's correction
• Several other useful measures of dispersion are related to the SD:
• Variance: Variance of N observations, x1, x2, . . . , xN, for a numeric attribute X
is
• The variance is just the square of the SD. For the IQ example, the variance =
14.42 = 207.36.
• Coefficient of variation: The coefficient of variation (CV) is the SD divided by
the mean. For the IQ example, CV = 14.4/98.3 = 0.1465, or 14.65 percent.
Measuring the Dispersion of Data (cont..)
Quartiles
It divide an ordered data set into four equal parts.
The values which divide each part are called the first, second, and third quartiles; they are
denoted by Q1, Q2, and Q3, respectively.
Q1 is the middle value of the first half of the ordered data set, Q2 is the median value in
the set, and Q3 is the middle value in the second half of the ordered data set.
Interquartile range (IQR)
A good measure of the spread of data is the interquartile range (IQR) or the difference
between Q3 and Q1. This gives us the width of the box, as well. A small width means more
consistent data values since it indicates less variation in the data or that data values are
closer together. So, IQR = Q3 - Q1
Measuring the Dispersion of Data (cont..)
Interquartile Range = Q3-Q1
• With an Even Sample Size:
• For the sample (n=10) the median
diastolic blood pressure is 71 (50% of the
values are above 71, and 50% are below).
• The quartiles can be determined in the
same way we determined the median,
except we consider each half of the data
set separately.
• With an Odd Sample Size:
• For the sample (n=10) the median
diastolic blood pressure is 72 (50% of the
values are above 72, and 50% are below).
• When the sample size is odd, the median
and quartiles are determined in the same
way.
• Suppose in the previous example, the
lowest value (62) were excluded, and the
sample size was n=9. The median and
quartiles are indicated below.
Outliers and Tukey Fences:
• Tukey Fences
• When there are no outliers in a sample, the mean and standard deviation are used to
summarize a typical value and the variability in the sample, respectively.
• When there are outliers in a sample, the median and interquartile range are used to
summarize a typical value and the variability in the sample, respectively.
• Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1) or equivalently, values
below Q1-1.5 IQR or above Q3+1.5 IQR.
• In previous example, for the diastolic blood pressures, the lower limit is 64 - 1.5(77-64) =
44.5 and the upper limit is 77 + 1.5(77-64) = 96.5. The diastolic blood pressures range from
62 to 81. Therefore there are no outliers.
Example : The Full Framingham Cohort Data
• The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on residents
of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects
from Framingham, and is now on its third generation of participants.
• Table 1 displays the means, standard deviations, medians, quartiles and interquartile ranges for
each of the continuous variables in the subsample of n=10 participants who attended the
seventh examination of the Framingham Offspring Study.
Table 1 - Summary Statistics on n=10 Participants
• Table 2 displays the observed minimum and maximum values along with the limits to determine
outliers using the quartile rule for each of the variables in the subsample of n=10 participants.
• Are there outliers in any of the variables? Which statistics are most appropriate to summarize
the average or typical value and the dispersion?
Table 2 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants
Since there are no suspected
outliers in the subsample of n=10
participants, the mean and
standard deviation are the most
appropriate statistics to
summarize average values and
dispersion, respectively, of each
of these characteristics.
Continue.....
• For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to
illustrate calculations of summary statistics and determination of outliers. For your interest,
Table 3 displays the means, standard deviations, medians, quartiles and interquartile ranges
for each of the continuous variable displayed in Table 1 in the full sample (n=3,539) of
participants who attended the seventh examination of the Framingham Offspring Study.
Table 3-Summary Statistics on Sample of (n=3,539) Participants
• Table 4 displays the observed minimum and maximum values along with the limits to determine
outliers using the quartile rule for each of the variables in the full sample (n=3,539).
Continue.....
Table 4 - Limits for Assessing Outliers in Characteristics Presented in Table 3
Are there outliers in any of
the variables?
Which statistics are most
appropriate to summarize
the average or typical values
and the dispersion for each
variable?
• In the full sample, each of the characteristics has outliers on the upper end of the distribution
as the maximum values exceed the upper limits in each case. There are also outliers on the low
end for diastolic blood pressure and total cholesterol, since the minimums are below the lower
limits.
• For some of these characteristics, the difference between the upper limit and the maximum (or
the lower limit and the minimum) is small (e.g., height, systolic and diastolic blood pressures),
while for others (e.g., total cholesterol, weight and body mass index) the difference is much
larger. This method for determining outliers is a popular one but not generally applied as a hard
and fast rule. In this application it would be reasonable to present means and standard
deviations for height, systolic and diastolic blood pressures and medians and interquartile
ranges for total cholesterol, weight and body mass index.
Observations on example......
Boxplot Analysis
• Boxplots are a popular way of visualizing a distribution. A
boxplot incorporates the five-number summary as follows:
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e.,
the height of the box is IQR
• The median is marked by a line within the box
• Two lines (called whiskers) outside the box extend to the
smallest (Minimum) and largest (Maximum) observations.
• Outliers: points beyond a specified outlier threshold,
plotted individually
Boxplot for the unit price data for items sold at four branches of
AllElectronics during a given time period.
Finally, boxplots often provide information about the shape of a data set. The examples below
show some common patterns.
Boxplot Analysis (cont..)
Each of the boxplots illustrates a different
skewness pattern.
If most of the observations are concentrated on
the low end of the scale, the distribution is skewed
right; and vice versa.
If a distribution is symmetric, the observations will
be evenly split at the median, as shown in the
middle figure.
The beauty of the normal curve:
68-95-99.7 Rule
No matter what  and  are,
• the area between - and + is about 68%;
the area between -2 and +2 is about
95%;
• and the area between -3 and +3 is
about 99.7%.
• Almost all values fall within 3 standard
deviations. (μ: mean, σ: standard deviation)
Are my data “normal”?
 Not all continuous random variables are normally distributed!!
 It is important to evaluate how well the data are approximated by
a normal distribution
Quantile Plot
• It cover common graphic displays of data distributions.
• A quantile plot is a simple and effective way to have a first look at a univariate data
distribution. It used to check whether your data is Normal.
• Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest
observation and xN is the largest for some ordinal or numeric attribute X.
• Each observation, xi, is paired with a percentage, fi, which indicates that approximately
fi × 100% of the data are below the value, xi.
• We say “approximately” because there may not be a value with exactly a fraction, fi, of
the data below xi.
• Note that the 0.25 percentage corresponds to quartile Q1, the 0.50 percentage is the
median, and the 0.75 percentage is Q3.
On a quantile plot, xi is graphed against fi. This allows us to compare different
distributions based on their quantiles. For example, given the quantile plots
of sales data for two different time periods, we can compare their Q1,
median, Q3, and other fi values at a glance.
Quantile Plot (cont..)
Quantile–Quantile Plot
• A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate
distribution against the corresponding quantiles of another.
• It is a powerful visualization tool in that it allows the user to view whether there
is a shift in going fromone distribution to another.
Suppose that we have two sets of observations for the attribute or
variable unit price, taken from two different branch locations.
• Let x1, .... ,xN be the data from the first branch, and y1, ......, yM be the
data from the second, where each data set is sorted in increasing
order.
• If M = N (i.e., the number of points in each set is the same), then we
simply plot yi against xi , where yi and xi are both (i-0.5)/N quantiles
of their respective data sets.
• If M < N (i.e., the second branch has fewer observations than the
first), there can be only M points on the q-q plot. Here, yi is the (i-
0.5)/M quantile of the y data, which is plotted against the (i -0.5)/M
quantile of the x data. This computation typically involves
interpolation.
Each point corresponds to the same quantile for each data set and shows the unit price of items sold at branch
1 versus branch 2 for that quantile.
Quantile–Quantile Plot (cont.)
(To aid in comparison, the straight line represents the
case where, for each given quantile, the unit price at
each branch is the same.
• The darker points correspond to the data for Q1,
the median, and Q3, respectively.)
• For example, that at Q1, the unit price of items sold
at branch 1 was slightly less than that at branch 2.
• In other words, 25% of items sold at branch 1 were
less than or equal to $60, while 25% of items sold
at branch 2 were less than or equal to $64. At the
50th percentile (marked by the median, which is
also Q2), we see that 50% of items sold at branch 1
were less than $78, while 50% of items at branch 2
were less than $85.
In general, we note that there is a shift in the distribution of branch 1 with respect to branch 2 in that the unit prices of items
sold at branch 1 tend to be lower than those at branch 2.
40
Scatter Plots and Data Correlation
• Provides a first look at bivariate data to see clusters of points, outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted as points in the plane
• The scatter plot is a useful method for providing a first look at bivariate data
to see clusters of points and outliers, or to explore the possibility of
correlation relationships.
• Two attributes, X, and Y, are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated).
• If the plotted points pattern slopes from lower left to upper right, this
means that the values of X increase as the values of Y increase, suggesting a
positive correlation .
• If the pattern of plotted points slopes from upper left to lower right, the
values of X increase as the values of Y decrease, suggesting a negative
correlation.
• A line of best fit can be drawn to study the correlation between the
variables.
Scatter Plots and Data Correlation (cont.)
42
Positively and Negatively Correlated Data
• The left half fragment is positively
correlated
• The right half is negative correlated
43
Uncorrelated Data
44
Histogram Analysis
• Histogram: Graph display of tabulated
frequencies, shown as bars
• It shows what proportion of cases fall into
each of several categories
• Differs from a bar chart in that it is the area of
the bar that denotes the value, not the height
as in bar charts, a crucial distinction when the
categories are not of uniform width
• The categories are usually specified as non-
overlapping intervals of some variable. The
categories (bars) must be adjacent
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 90000
45
Histograms Often Tell More than Boxplots
 The two histograms shown in the left
may have the same boxplot
representation
 The same values for: min, Q1,
median, Q3, max
 But they have rather different data
distributions
Lect 3 background mathematics for Data Mining

Weitere ähnliche Inhalte

Was ist angesagt?

Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Edureka!
 
Scatterplots, Correlation, and Regression
Scatterplots, Correlation, and RegressionScatterplots, Correlation, and Regression
Scatterplots, Correlation, and RegressionLong Beach City College
 
Numerical Descriptive Measures
Numerical Descriptive MeasuresNumerical Descriptive Measures
Numerical Descriptive MeasuresYesica Adicondro
 
Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing Long Beach City College
 
Anova (Analysis of variation)
Anova (Analysis of variation)Anova (Analysis of variation)
Anova (Analysis of variation)Shakeel Rehman
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptxmahamoh6
 
Mixed between-within groups ANOVA
Mixed between-within groups ANOVAMixed between-within groups ANOVA
Mixed between-within groups ANOVAMahsa Farahanynia
 
hypothesis, testing of hypothesis
hypothesis, testing of hypothesishypothesis, testing of hypothesis
hypothesis, testing of hypothesisKavitha Ravi
 
Non parametric methods
Non parametric methodsNon parametric methods
Non parametric methodsPedro Moreira
 
binomial distribution
binomial distributionbinomial distribution
binomial distributionMmedsc Hahm
 
Measures of Central Tendency, Variability and Shapes
Measures of Central Tendency, Variability and ShapesMeasures of Central Tendency, Variability and Shapes
Measures of Central Tendency, Variability and ShapesScholarsPoint1
 
Chap02 presenting data in chart & tables
Chap02 presenting data in chart & tablesChap02 presenting data in chart & tables
Chap02 presenting data in chart & tablesUni Azza Aunillah
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysishktripathy
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IJames Neill
 
Lecture 4: Statistical Inference
Lecture 4: Statistical InferenceLecture 4: Statistical Inference
Lecture 4: Statistical InferenceMarina Santini
 
Skewness and Kurtosis[1].pptx
Skewness and Kurtosis[1].pptxSkewness and Kurtosis[1].pptx
Skewness and Kurtosis[1].pptxNethravathiK10
 
What is a single sample t test?
What is a single sample t test?What is a single sample t test?
What is a single sample t test?Ken Plummer
 

Was ist angesagt? (20)

Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
 
Scatterplots, Correlation, and Regression
Scatterplots, Correlation, and RegressionScatterplots, Correlation, and Regression
Scatterplots, Correlation, and Regression
 
Numerical Descriptive Measures
Numerical Descriptive MeasuresNumerical Descriptive Measures
Numerical Descriptive Measures
 
Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Anova (Analysis of variation)
Anova (Analysis of variation)Anova (Analysis of variation)
Anova (Analysis of variation)
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptx
 
Mixed between-within groups ANOVA
Mixed between-within groups ANOVAMixed between-within groups ANOVA
Mixed between-within groups ANOVA
 
hypothesis, testing of hypothesis
hypothesis, testing of hypothesishypothesis, testing of hypothesis
hypothesis, testing of hypothesis
 
Non parametric methods
Non parametric methodsNon parametric methods
Non parametric methods
 
binomial distribution
binomial distributionbinomial distribution
binomial distribution
 
Measures of Central Tendency, Variability and Shapes
Measures of Central Tendency, Variability and ShapesMeasures of Central Tendency, Variability and Shapes
Measures of Central Tendency, Variability and Shapes
 
Chap02 presenting data in chart & tables
Chap02 presenting data in chart & tablesChap02 presenting data in chart & tables
Chap02 presenting data in chart & tables
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 
Multiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA IMultiple Linear Regression II and ANOVA I
Multiple Linear Regression II and ANOVA I
 
Quartile
QuartileQuartile
Quartile
 
Lecture 4: Statistical Inference
Lecture 4: Statistical InferenceLecture 4: Statistical Inference
Lecture 4: Statistical Inference
 
Skewness and Kurtosis[1].pptx
Skewness and Kurtosis[1].pptxSkewness and Kurtosis[1].pptx
Skewness and Kurtosis[1].pptx
 
What is a single sample t test?
What is a single sample t test?What is a single sample t test?
What is a single sample t test?
 
Effect size
Effect sizeEffect size
Effect size
 

Ähnlich wie Lect 3 background mathematics for Data Mining

Descriptive Statistics.pptx
Descriptive Statistics.pptxDescriptive Statistics.pptx
Descriptive Statistics.pptxtest215275
 
3. Statistical Analysis.pptx
3. Statistical Analysis.pptx3. Statistical Analysis.pptx
3. Statistical Analysis.pptxjeyanthisivakumar
 
Basics of Stats (2).pptx
Basics of Stats (2).pptxBasics of Stats (2).pptx
Basics of Stats (2).pptxmadihamaqbool6
 
presentation_statistics_1448025870_153985.ppt
presentation_statistics_1448025870_153985.pptpresentation_statistics_1448025870_153985.ppt
presentation_statistics_1448025870_153985.pptAKSAKS12
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptxCalvinAdorDionisio
 
Measures of central tendancy
Measures of central tendancy Measures of central tendancy
Measures of central tendancy Pranav Krishna
 
Describing quantitative data with numbers
Describing quantitative data with numbersDescribing quantitative data with numbers
Describing quantitative data with numbersUlster BOCES
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxErgin Akalpler
 
2. chapter ii(analyz)
2. chapter ii(analyz)2. chapter ii(analyz)
2. chapter ii(analyz)Chhom Karath
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics Bahzad5
 
best for normal distribution.ppt
best for normal distribution.pptbest for normal distribution.ppt
best for normal distribution.pptDejeneDay
 
statical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptstatical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptNazarudinManik1
 
Biostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBiostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBatizemaryam
 
measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...SoujanyaLk1
 
Central tendency _dispersion
Central tendency _dispersionCentral tendency _dispersion
Central tendency _dispersionKirti Gupta
 
descriptive data analysis
 descriptive data analysis descriptive data analysis
descriptive data analysisgnanasarita1
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysiskompellark
 
Summary statistics
Summary statisticsSummary statistics
Summary statisticsRupak Roy
 

Ähnlich wie Lect 3 background mathematics for Data Mining (20)

Descriptive Statistics.pptx
Descriptive Statistics.pptxDescriptive Statistics.pptx
Descriptive Statistics.pptx
 
3. Statistical Analysis.pptx
3. Statistical Analysis.pptx3. Statistical Analysis.pptx
3. Statistical Analysis.pptx
 
Basics of Stats (2).pptx
Basics of Stats (2).pptxBasics of Stats (2).pptx
Basics of Stats (2).pptx
 
presentation_statistics_1448025870_153985.ppt
presentation_statistics_1448025870_153985.pptpresentation_statistics_1448025870_153985.ppt
presentation_statistics_1448025870_153985.ppt
 
Measure of Variability Report.pptx
Measure of Variability Report.pptxMeasure of Variability Report.pptx
Measure of Variability Report.pptx
 
Measures of central tendancy
Measures of central tendancy Measures of central tendancy
Measures of central tendancy
 
Describing quantitative data with numbers
Describing quantitative data with numbersDescribing quantitative data with numbers
Describing quantitative data with numbers
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
 
2. chapter ii(analyz)
2. chapter ii(analyz)2. chapter ii(analyz)
2. chapter ii(analyz)
 
Unit 2 - Statistics
Unit 2 - StatisticsUnit 2 - Statistics
Unit 2 - Statistics
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
best for normal distribution.ppt
best for normal distribution.pptbest for normal distribution.ppt
best for normal distribution.ppt
 
statical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.pptstatical-data-1 to know how to measure.ppt
statical-data-1 to know how to measure.ppt
 
Statistics.pdf
Statistics.pdfStatistics.pdf
Statistics.pdf
 
Biostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacyBiostatistics cource for clinical pharmacy
Biostatistics cource for clinical pharmacy
 
measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...measures of central tendency in statistics which is essential for business ma...
measures of central tendency in statistics which is essential for business ma...
 
Central tendency _dispersion
Central tendency _dispersionCentral tendency _dispersion
Central tendency _dispersion
 
descriptive data analysis
 descriptive data analysis descriptive data analysis
descriptive data analysis
 
T7 data analysis
T7 data analysisT7 data analysis
T7 data analysis
 
Summary statistics
Summary statisticsSummary statistics
Summary statistics
 

Mehr von hktripathy

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematicshktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data samplinghktripathy
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualizationhktripathy
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmhktripathy
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysishktripathy
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your datahktripathy
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

Mehr von hktripathy (17)

Lect 3 background mathematics
Lect 3 background mathematicsLect 3 background mathematics
Lect 3 background mathematics
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Lecture7.1 data sampling
Lecture7.1 data samplingLecture7.1 data sampling
Lecture7.1 data sampling
 
Lecture5 virtualization
Lecture5 virtualizationLecture5 virtualization
Lecture5 virtualization
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithm
 
Lect5 principal component analysis
Lect5 principal component analysisLect5 principal component analysis
Lect5 principal component analysis
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Lect 2 getting to know your data
Lect 2 getting to know your dataLect 2 getting to know your data
Lect 2 getting to know your data
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 

Kürzlich hochgeladen

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 

Kürzlich hochgeladen (20)

Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Lect 3 background mathematics for Data Mining

  • 1. Know your Data (Background Mathematics) DATA MINING AND DATA WAREHOUSING CS- 6301 (Cr-4) Dr. Hrudaya Kumar Tripathy
  • 2. Basic Statistical Descriptions of Data The entire subject of statistics is based around the idea that we have this big set of data, and we want to analyse that set in terms of the relationships between the individual points in that data set. • Motivation • To better understand the data: central tendency, variation and spread • Data dispersion characteristics • median, max, min, quantiles, outliers, variance, etc. • Numerical dimensions correspond to sorted intervals • Data dispersion: analyzed with multiple granularities of precision • Boxplot or quantile analysis on sorted intervals • Dispersion analysis on computed measures • Folding measures into numerical dimensions • Boxplot or quantile analysis on the transformed cube
  • 3. Measuring the Central Tendency • The most common and effective numeric measure of the “center” of a set of data is the (arithmetic) mean. Let x1,x2, ...... ,xN be a set of N values or observations, such as for some numeric attribute X, like salary. • Sometimes, each value xi in a set may be associated with a weight wi for i = 1, .... ,N. The weights reflect the significance, importance, or occurrence frequency attached to their respective values. In this case, we can compute Mean (algebraic measure) (sample vs. population): Note: N is population size. The mean of this set of values is This is called the weighted arithmetic mean or the weighted average.
  • 4. • A trimmed mean (sometimes called a truncated mean) is similar to a mean, but it trims any outliers. Outliers can affect the mean (especially if there are just one or two very large values), so a trimmed mean can often be a better fit for data sets with erratic high or low values or for extremely skewed distributions. Even a small number of extreme values can corrupt the mean. • For example, the mean salary at a company may be substantially pushed up by that of a few highly paid managers. Similarly, the mean score of a class in an exam could be pulled down quite a bit by a few very low scores. • Which is the mean obtained after chopping off values at the high and low extremes. • Example: Find the trimmed 20% mean for the following test scores: 60, 81, 83, 91, 99. • Step 1: Trim the top and bottom 20% from the data. That leaves us with the middle three values: 60, 81, 83, 91, 99. • Step 2: Find the mean with the remaining values. The mean is (81 + 83 + 91) / 3 ) = 85. Measuring the Central Tendency
  • 5. • Median: • The median of a set of data is the middlemost number in the set. The median is also the number that is halfway into the set. • To find the median, the data should first be arranged in order from least to greatest. • Middle value if odd number of values, or average of the middle two values otherwise • What will be the median estimated by interpolation (for grouped data)? Measuring the Central Tendency
  • 6. • Mode • The mode for a set of data is the value that occurs most frequently in the set. Therefore, it can be determined for qualitative and quantitative attributes. • It is possible for the greatest frequency to correspond to several different values, which results in more than one mode. • Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal. • In general, a data set with two or more modes is multimodal. • At the other extreme, if each data value occurs only once, then there is no mode. • Example: • 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70 • 62 appears three times, more often than the other values, so Mode = 62 Measuring the Central Tendency
  • 7. Mean, Median and Mode from Grouped Frequencies The Race and the Naughty Puppy • This starts with some raw data (not a grouped frequency yet) ... To find the Mean Alex adds up all the numbers, then divides by how many numbers: Mean = (59+65+61+62+53+55+60+70+64+56+58+58+62+62+68+65+56+59+68+61+67)/21 = 61.38095...
  • 8. • To find the Median Alex places the numbers in value order and finds the middle number. • In this case the median is the 11th number: 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70 • Median = 61 • To find the Mode, or modal value, Alex places the numbers in value order then counts how many of each number. The Mode is the number which appears most often (there can be more than one mode): 53, 55, 56, 56, 58, 58, 59, 59, 60, 61, 61, 62, 62, 62, 64, 65, 65, 67, 68, 68, 70 • 62 appears three times, more often than the other values, so Mode = 62 Mean, Median and Mode from Grouped Frequencies
  • 9. • Grouped Frequency Table • Alex then makes a Grouped Frequency Table: • So 2 runners took between 51 and 55 seconds, 7 took between 56 and 60 seconds, etc Mean, Median and Mode from Grouped Frequencies
  • 10. We can estimate the Mean by using the midpoints. Let's now make the table using midpoints: Estimating the Mean from Grouped Data
  • 11. Estimating the Mean from Grouped Data • Our thinking is: "2 people took 53 sec, 7 people took 58 sec, 8 people took 63 sec and 4 took 68 sec". In other words we imagine the data looks like this: 53, 53, 58, 58, 58, 58, 58, 58, 58, 63, 63, 63, 63, 63, 63, 63, 63, 68, 68, 68, 68 • Then we add them all up and divide by 21. The quick way to do it is to multiply each midpoint by each frequency: And then our estimate of the mean time to complete the race is:
  • 12. Estimating the Median from Grouped Data • Let's look at our data again: • The median is the middle value, which in our case is the 11th one, which is in the 61 - 65 group: • We can say "the median group is 61 - 65" • But if we want an estimated Median value we need to look more closely at the 61 - 65 group.
  • 13. • At 60.5 we already have 9 runners, and by the next boundary at 65.5 we have 17 runners. • By drawing a straight line in between we can pick out where the median frequency of n/2 runners is: Estimating the Median from Grouped Data And this handy formula does the calculation: where: L is the lower class boundary of the group containing the median n is the total number of values B is the cumulative frequency of the groups before the median group G is the frequency of the median group w is the group width
  • 14. For our example: L = 60.5 n = 21 B = 2 + 7 = 9 G = 8 w = 5 width freq lfreqn Lmedian median ) )(2/ (1  
  • 15. Estimating the Mode from Grouped Data • Again, looking at our data: • We can easily find the modal group (the group with the highest frequency), which is 61 - 65 • We can say "the modal group is 61 - 65" • But the actual Mode may not even be in that group! Or there may be more than one mode. Without the raw data we don't really know. But, we can estimate the Mode using the following formula: where: L is the lower class boundary of the modal group fm is the frequency of the modal group fm-1 is the frequency of the group before the modal group w is the group width f is the frequency of the group after the modal group
  • 16. L = 60.5 fm-1 = 7 fm = 8 fm+1 = 4 w = 5 For our example:
  • 17. Baby Carrots Example Example: You grew fifty baby carrots using special soil. You dig them up and measure their lengths (to the nearest mm) and group the results: Age Example The ages of the 112 people who live on a tropical island are grouped as follows:
  • 18. Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data • Data in most real applications are not symmetric (a). They may instead be either positively skewed (b), where the mode occurs at a value that is smaller than the median or negatively skewed (c), where the mode occurs at a value greater than the median.
  • 19. Measuring the Dispersion of Data • Range, Quartiles, Variance, Standard Deviation, and Interquartile Range • We now look at measures to assess the dispersion or spread of numeric data. The measures include range, quartiles, percentiles, and the interquartile range. • The five-number summary, which can be displayed as a boxplot, is useful in identifying outliers. • Variance and standard deviation also indicate the spread of a data distribution.
  • 20. Range • Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X. • The range of the set is the difference between the largest (max()) and smallest (min()) values. Standard Deviation • The standard deviation (usually abbreviated SD, sd, or just s) of a bunch of numbers tells you how much the individual numbers tend to differ (in either direction) from the mean. It’s calculated as follows: Measuring the Dispersion of Data (cont..) This formula is saying that you calculate the standard deviation of a set of N numbers (Xi) by subtracting the mean from each value to get the deviation (di) of each value from the mean, squaring each of these deviations, adding up the (di)2 terms, dividing by N – 1, and then taking the square root.
  • 21. Standard Deviation (cont.) • This is almost identical to the formula for the root- mean-square deviation of the points from the mean, except that it has N – 1 in the denominator instead of N. • This difference occurs because the sample mean is used as an approximation of the true population mean (which you don’t know). If the true mean were available to use, the denominator would be N. • When talking about population distributions, the SD describes the width of the distribution curve. The figure shows three normal distributions. They all have a mean of zero, but they have different standard deviations and, therefore, different widths. Each distribution curve has a total area of exactly 1.0, so the peak height is smaller when the SD is larger. For an IQ example (84, 84, 89, 91, 110, 114, and 116) where the mean is 98.3, you calculate the SD as follows: Standard deviations are very sensitive to extreme values (outliers) in the data. For example, if the highest value in the IQ dataset had been 150 instead of 116, the SD would have gone up from 14.4 to 23.9. Measuring the Dispersion of Data (cont..)
  • 22. Why n-1 in Standard Deviation? Why divide by n-1 rather than n? You compute the difference between each value and the mean of those values. You don't know the true mean of the population; all you know is the mean of your sample. Except for the rare cases where the sample mean happens to equal the population mean, the data will be closer to the sample mean than it will be to the true population mean. So the value you compute will probably be a bit smaller (and can't be larger) than what it would be if you used the true population mean. To make up for this, divide by n-1 rather than n. This is called Bessel's correction. But why n-1? If you knew the sample mean, and all but one of the values, you could calculate what that last value must be. Statisticians say there are n-1 degrees of freedom. Bessel's correction
  • 23. • Several other useful measures of dispersion are related to the SD: • Variance: Variance of N observations, x1, x2, . . . , xN, for a numeric attribute X is • The variance is just the square of the SD. For the IQ example, the variance = 14.42 = 207.36. • Coefficient of variation: The coefficient of variation (CV) is the SD divided by the mean. For the IQ example, CV = 14.4/98.3 = 0.1465, or 14.65 percent. Measuring the Dispersion of Data (cont..)
  • 24. Quartiles It divide an ordered data set into four equal parts. The values which divide each part are called the first, second, and third quartiles; they are denoted by Q1, Q2, and Q3, respectively. Q1 is the middle value of the first half of the ordered data set, Q2 is the median value in the set, and Q3 is the middle value in the second half of the ordered data set. Interquartile range (IQR) A good measure of the spread of data is the interquartile range (IQR) or the difference between Q3 and Q1. This gives us the width of the box, as well. A small width means more consistent data values since it indicates less variation in the data or that data values are closer together. So, IQR = Q3 - Q1 Measuring the Dispersion of Data (cont..)
  • 25. Interquartile Range = Q3-Q1 • With an Even Sample Size: • For the sample (n=10) the median diastolic blood pressure is 71 (50% of the values are above 71, and 50% are below). • The quartiles can be determined in the same way we determined the median, except we consider each half of the data set separately. • With an Odd Sample Size: • For the sample (n=10) the median diastolic blood pressure is 72 (50% of the values are above 72, and 50% are below). • When the sample size is odd, the median and quartiles are determined in the same way. • Suppose in the previous example, the lowest value (62) were excluded, and the sample size was n=9. The median and quartiles are indicated below.
  • 26. Outliers and Tukey Fences: • Tukey Fences • When there are no outliers in a sample, the mean and standard deviation are used to summarize a typical value and the variability in the sample, respectively. • When there are outliers in a sample, the median and interquartile range are used to summarize a typical value and the variability in the sample, respectively. • Outliers are values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1) or equivalently, values below Q1-1.5 IQR or above Q3+1.5 IQR. • In previous example, for the diastolic blood pressures, the lower limit is 64 - 1.5(77-64) = 44.5 and the upper limit is 77 + 1.5(77-64) = 96.5. The diastolic blood pressures range from 62 to 81. Therefore there are no outliers.
  • 27. Example : The Full Framingham Cohort Data • The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study on residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its third generation of participants. • Table 1 displays the means, standard deviations, medians, quartiles and interquartile ranges for each of the continuous variables in the subsample of n=10 participants who attended the seventh examination of the Framingham Offspring Study. Table 1 - Summary Statistics on n=10 Participants
  • 28. • Table 2 displays the observed minimum and maximum values along with the limits to determine outliers using the quartile rule for each of the variables in the subsample of n=10 participants. • Are there outliers in any of the variables? Which statistics are most appropriate to summarize the average or typical value and the dispersion? Table 2 - Limits for Assessing Outliers in Characteristics Measured in the n=10 Participants Since there are no suspected outliers in the subsample of n=10 participants, the mean and standard deviation are the most appropriate statistics to summarize average values and dispersion, respectively, of each of these characteristics.
  • 29. Continue..... • For clarity, we have so far used a very small subset of the Framingham Offspring Cohort to illustrate calculations of summary statistics and determination of outliers. For your interest, Table 3 displays the means, standard deviations, medians, quartiles and interquartile ranges for each of the continuous variable displayed in Table 1 in the full sample (n=3,539) of participants who attended the seventh examination of the Framingham Offspring Study. Table 3-Summary Statistics on Sample of (n=3,539) Participants
  • 30. • Table 4 displays the observed minimum and maximum values along with the limits to determine outliers using the quartile rule for each of the variables in the full sample (n=3,539). Continue..... Table 4 - Limits for Assessing Outliers in Characteristics Presented in Table 3 Are there outliers in any of the variables? Which statistics are most appropriate to summarize the average or typical values and the dispersion for each variable?
  • 31. • In the full sample, each of the characteristics has outliers on the upper end of the distribution as the maximum values exceed the upper limits in each case. There are also outliers on the low end for diastolic blood pressure and total cholesterol, since the minimums are below the lower limits. • For some of these characteristics, the difference between the upper limit and the maximum (or the lower limit and the minimum) is small (e.g., height, systolic and diastolic blood pressures), while for others (e.g., total cholesterol, weight and body mass index) the difference is much larger. This method for determining outliers is a popular one but not generally applied as a hard and fast rule. In this application it would be reasonable to present means and standard deviations for height, systolic and diastolic blood pressures and medians and interquartile ranges for total cholesterol, weight and body mass index. Observations on example......
  • 32. Boxplot Analysis • Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-number summary as follows: • Five-number summary of a distribution • Minimum, Q1, Median, Q3, Maximum • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations. • Outliers: points beyond a specified outlier threshold, plotted individually Boxplot for the unit price data for items sold at four branches of AllElectronics during a given time period.
  • 33. Finally, boxplots often provide information about the shape of a data set. The examples below show some common patterns. Boxplot Analysis (cont..) Each of the boxplots illustrates a different skewness pattern. If most of the observations are concentrated on the low end of the scale, the distribution is skewed right; and vice versa. If a distribution is symmetric, the observations will be evenly split at the median, as shown in the middle figure.
  • 34. The beauty of the normal curve: 68-95-99.7 Rule No matter what  and  are, • the area between - and + is about 68%; the area between -2 and +2 is about 95%; • and the area between -3 and +3 is about 99.7%. • Almost all values fall within 3 standard deviations. (μ: mean, σ: standard deviation)
  • 35. Are my data “normal”?  Not all continuous random variables are normally distributed!!  It is important to evaluate how well the data are approximated by a normal distribution
  • 36. Quantile Plot • It cover common graphic displays of data distributions. • A quantile plot is a simple and effective way to have a first look at a univariate data distribution. It used to check whether your data is Normal. • Let xi, for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest observation and xN is the largest for some ordinal or numeric attribute X. • Each observation, xi, is paired with a percentage, fi, which indicates that approximately fi × 100% of the data are below the value, xi. • We say “approximately” because there may not be a value with exactly a fraction, fi, of the data below xi. • Note that the 0.25 percentage corresponds to quartile Q1, the 0.50 percentage is the median, and the 0.75 percentage is Q3.
  • 37. On a quantile plot, xi is graphed against fi. This allows us to compare different distributions based on their quantiles. For example, given the quantile plots of sales data for two different time periods, we can compare their Q1, median, Q3, and other fi values at a glance. Quantile Plot (cont..)
  • 38. Quantile–Quantile Plot • A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the corresponding quantiles of another. • It is a powerful visualization tool in that it allows the user to view whether there is a shift in going fromone distribution to another. Suppose that we have two sets of observations for the attribute or variable unit price, taken from two different branch locations. • Let x1, .... ,xN be the data from the first branch, and y1, ......, yM be the data from the second, where each data set is sorted in increasing order. • If M = N (i.e., the number of points in each set is the same), then we simply plot yi against xi , where yi and xi are both (i-0.5)/N quantiles of their respective data sets. • If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-q plot. Here, yi is the (i- 0.5)/M quantile of the y data, which is plotted against the (i -0.5)/M quantile of the x data. This computation typically involves interpolation.
  • 39. Each point corresponds to the same quantile for each data set and shows the unit price of items sold at branch 1 versus branch 2 for that quantile. Quantile–Quantile Plot (cont.) (To aid in comparison, the straight line represents the case where, for each given quantile, the unit price at each branch is the same. • The darker points correspond to the data for Q1, the median, and Q3, respectively.) • For example, that at Q1, the unit price of items sold at branch 1 was slightly less than that at branch 2. • In other words, 25% of items sold at branch 1 were less than or equal to $60, while 25% of items sold at branch 2 were less than or equal to $64. At the 50th percentile (marked by the median, which is also Q2), we see that 50% of items sold at branch 1 were less than $78, while 50% of items at branch 2 were less than $85. In general, we note that there is a shift in the distribution of branch 1 with respect to branch 2 in that the unit prices of items sold at branch 1 tend to be lower than those at branch 2.
  • 40. 40 Scatter Plots and Data Correlation • Provides a first look at bivariate data to see clusters of points, outliers, etc • Each pair of values is treated as a pair of coordinates and plotted as points in the plane
  • 41. • The scatter plot is a useful method for providing a first look at bivariate data to see clusters of points and outliers, or to explore the possibility of correlation relationships. • Two attributes, X, and Y, are correlated if one attribute implies the other. Correlations can be positive, negative, or null (uncorrelated). • If the plotted points pattern slopes from lower left to upper right, this means that the values of X increase as the values of Y increase, suggesting a positive correlation . • If the pattern of plotted points slopes from upper left to lower right, the values of X increase as the values of Y decrease, suggesting a negative correlation. • A line of best fit can be drawn to study the correlation between the variables. Scatter Plots and Data Correlation (cont.)
  • 42. 42 Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated
  • 44. 44 Histogram Analysis • Histogram: Graph display of tabulated frequencies, shown as bars • It shows what proportion of cases fall into each of several categories • Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width • The categories are usually specified as non- overlapping intervals of some variable. The categories (bars) must be adjacent 0 5 10 15 20 25 30 35 40 10000 30000 50000 70000 90000
  • 45. 45 Histograms Often Tell More than Boxplots  The two histograms shown in the left may have the same boxplot representation  The same values for: min, Q1, median, Q3, max  But they have rather different data distributions