Session 3&4.pptx

Descriptive Statistics
• Tabular, graphical, or numerical summaries of data.
Age
Mean 42.57
Median 40
Mode 40
Standard Deviation 10.63
Sample Variance 113.01
Range 44
Minimum 21
Maximum 65
Frequency
Female 12
Male 18
Grand Total 30
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5
Frequency
Opinion
Bar Chart for Opinions

Summarizing Data for Categorical Variables
• Let us focus on Tabular and Graphical summaries first. We will deal
with numerical summaries later.
• Tabular:
• Frequency distribution
• Relative frequency distribution
• Percent frequency distribution
• Graphical:
• Bar chart
• Pie chart

Frequency Distribution
• A frequency distribution is a tabular summary of data showing the
number (frequency) of observations in each of several non-
overlapping categories or classes.
Opinion Frequency
Strongly disagree 8
Disagree 4
Neutral 6
Agree 7
Strongly agree 5
Grand Total 30

Relative Frequency Distribution
Relative frequency of a class =
Frequency of the class
Total number of observations
Percent frequency of a class =
Frequency of the class
Total number of observations
× 100 %

Opinion Frequency Relative frequency Percent Frequency
Strongly disagree 8 0.27 27%
Disagree 4 0.13 13%
Neutral 6 0.20 20%
Agree 7 0.23 23%
Strongly agree 5 0.17 17%
Grand Total 30 1.00 100%

Bar Chart
0
5
10
15
20
25
Elderly Middle-aged Young
FREQUENCY
AGE CATEGORY
Number of people in each age category

Pie Chart
Age distribution of people

Summarizing Data for Quantitative Variables
• Let us focus on Tabular and Graphical summaries first. We will deal
with numerical summaries later.
• Tabular:
• Frequency distribution
• Relative frequency distribution
• Percent frequency distribution
• Graphical:
• Histogram

• We need to bin/bucket the quantitative variable of interest.
• Three Steps:
1. Determine the number of nonoverlapping classes.
2. Determine the width of each class.
3. Determine the class limits.
• Choosing the number of classes is tricky! It is done by trial and error.
• Five to twenty classes are preferred. (Not too few, not too many, just
enough to informatively show the variation in the frequencies.)

Approximate class width =
Largest data value − Smallest data value
Number of classes

Class Frequency
[31000, 35200] 1
(35200, 39400] 3
(39400, 43600] 2
(43600, 47800] 7
(47800, 52000] 3
(52000, 56200] 4
(56200, 60400] 3
(60400, 64600] 4
(64600, 68800] 1
(68800, 73000] 0
(73000, 77200] 0
(77200, 81400] 2

Relative/Percent Frequency
Class Frequency Rel. Freq. Perc. Freq.
[31000, 35200] 1 0.033 3.33
(35200, 39400] 3 0.100 10.00
(39400, 43600] 2 0.067 6.67
(43600, 47800] 7 0.233 23.33
(47800, 52000] 3 0.100 10.00
(52000, 56200] 4 0.133 13.33
(56200, 60400] 3 0.100 10.00
(60400, 64600] 4 0.133 13.33
(64600, 68800] 1 0.033 3.33
(68800, 73000] 0 0.000 0.00
(73000, 77200] 0 0.000 0.00
(77200, 81400] 2 0.067 6.67

Skewness
• To which side is the tail of the distribution longer or more drawn out?
• Positive/Right skew
• Negative/Left skew
• Zero skewness means symmetric distribution.

Summarizing Data for Two Categorical
Variables
• Tabular
• Crosstabulation
• Graphical
• Side-by-side bar chart
• Stacked bar chart

Crosstabulation
Strongly agree Agree Neutral Disagree Strongly disagree Grand Total
Elderly 0 0 0 0 3 3
Middle-aged 4 6 5 2 4 21
Young 1 1 1 2 1 6
Grand Total 5 7 6 4 8 30

Side-by-side Bar Chart
0
1
2
3
4
5
6
7
Strongly agree Agree Neutral Disagree Strongly disagree
Frequency
Opinions
Opinions vs. age categories

Stacked Bar Chart
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Strongly agree Agree Neutral Disagree Strongly disagree
Percentage
Opinions
Opinions vs. age categories

Scatterplot: Visualizing the Relationship
Between Two Quantitative Variables
$0
$10,000
$20,000
$30,000
$40,000
$50,000
$60,000
$70,000
$80,000
$90,000
0 10 20 30 40 50 60 70
Salary
Age (years)
Salary vs. Age

Creating Effective Graphical Displays
• Give the display a clear and concise title.
• Keep the display simple.
• Clearly label each axis and provide the units of measure.
• If colors are used, make sure they are distinct.
• If multiple colors or line types are used, provide a legend.

Statistical Inference (Recap)
Population
Sample
Population parameter
E.g., Population average income 𝜇
Draw
Infer
Sample statistic
E.g., Sample average income 𝑥
A sample statistic is a point estimator of the corresponding
population parameter.

Descriptive Statistics: Numerical Measures
• Measures of location:
• Measures of central location: (A
single number which indicates a
typical value of the data.)
• Sample mean
• Sample median
• Sample mode
• Sample percentiles
• Sample quartiles
• Measures of variability: (A single
number which indicates the
variability in the data.)
• Sample range
• Sample IQR
• Sample variance
• Sample standard deviation
• Measures of distribution shape: (A
single number which lets us know
the shape of the distribution of the
data.)
• Skewness
• Kurtosis

Some Common Notation
• Let 𝑥 represent a variable of interest.
• Let 𝑛 be the number of observations in the sample. This is the sample
size.
• Let 𝑥𝑖 be the 𝑖𝑡ℎ observation.
• Let 𝑁 be the number of observations in the population. This is the
size of the population.

Measures of Location
• Measures of central location: (A single number which indicates a
typical value of the data.)
• Sample mean
• Sample median
• Sample mode
• Sample percentiles
• Sample quartiles

Sample Mean
Sample mean 𝑥 = 𝑖=1
𝑛
𝑥𝑖
𝑛
Population mean 𝜇 = 𝑖=1
𝑁
𝑥𝑖
𝑁

Sample Median
• The median of a data set is the value in the middle when the data
items are arranged in ascending order.
• The median divides the dataset into two parts, each with
approximately 50% of observations.
• Arrange the data in ascending order (smallest value to largest value).
• For an odd number of observations, the median is the middle value.
• For an even number of observations, the median is the average of the two
middle values.

Sample Mode
• The mode of a data set is the value that occurs with greatest
frequency.

Sample Percentile
• The 𝑝𝑡ℎ
percentile is a value such that at least 𝒑 percent of the
observations are less than or equal to this value and at least (𝟏𝟎𝟎 −
𝒑) percent of the observations are greater than or equal to this value

Sample Percentile
• Arrange the data in ascending order.
• Location of the 𝑝𝑡ℎ percentile:
𝐿𝑝 =
𝑝
100
(𝑛 + 1)

Sample Quartiles
• The quartiles divide the dataset into four parts, each with
approximately 25% of observations.
• First Quartile 𝑄1 = 25th Percentile
• Second Quartile 𝑄2 = 50th Percentile
• Third Quartile 𝑄3 = 75th Percentile

Measures of Variability
• Measures of variability: (A single number which indicates the
variability in the data.)
• Sample range
• Sample IQR
• Sample variance
• Sample standard deviation

Sample Range
Sample Range = Largest value – Smallest Value

Sample Interquartile Range (IQR)
𝐼𝑄𝑅 = 𝑄3 − 𝑄1

Box Plot
Q1
Median
Q3
Max value less
than inner fence
Min value greater
than inner fence
Q3 + 1.5*IQR
Inner fence
Q3 + 3*IQR
Outer fence
Q1 – 1.5*IQR
Inner fence
Q1 – 3*IQR
Outer fence
Major outlier Minor outlier

Sample Variance
Sample variance 𝑠2 = 𝑖=1
𝑛
𝑥𝑖−𝑥 2
𝑛−1
Population variance 𝜎2
= 𝑖=1
𝑁
𝑥𝑖−𝑥 2
𝑁

Sample Standard Deviation
Sample standard deviation 𝑠 = 𝑠2
Sample standard deviation 𝜎 = 𝜎2

Chebyshev’s Theorem
• At least (1 −
1
𝑧2) of the data values must be within 𝑧 standard
deviations of the mean, where 𝑧 is any value greater than 1.

Suppose that you are interested in analyzing the amount of time spent
by users browsing through Swiggy before they come to a decision
about what to order. You know that the average time spent browsing is
6.9 minutes. Suppose that the standard deviation is 1.2 minutes.
• What can you say about the percentage of users who spend between
4.5 minutes and 9.3 minutes browsing Swiggy?
• What can you say about the percentage of users who spend between
5.4 minutes and 9.3 minutes browsing Swiggy?

Measures of Association Between Two
Variables
• Covariance
• Correlation

Covariance
• Covariance is a descriptive measure of the strength of linear association
between two variables.
Sample covariance 𝑠𝑥𝑦 = 𝑖=1
𝑛
𝑥𝑖−𝑥 𝑦𝑖−𝑦
𝑛−1
Population Covariance 𝜎𝑥𝑦 = 𝑖=1
𝑁
𝑥𝑖−𝜇𝑥 𝑦𝑖−𝜇𝑦
𝑁
• +ve value  +ve relationship
• -ve value  -ve relationship
• Sensitive to units of measurement of the variables!

Correlation
• Correlation coefficient is a dimensionless measure of the strength of linear association
between two variables.
Sample correlation coefficient 𝑟𝑥𝑦 =
𝑠𝑥𝑦
𝑠𝑥𝑠𝑦
Population correlation coefficient 𝜌𝑥𝑦 =
𝜎𝑥𝑦
𝜎𝑥𝜎𝑦
• Bounded between [-1, 1]
• Values close to 0 indicate weak linear relationship.
• Values close to 1 indicate strong positive linear relationship.
• Values close to -1 indicate strong negative linear relationship.

Session 3&4.pptx

Recommended

Recommended

More Related Content

Similar to Session 3&4.pptx

Similar to Session 3&4.pptx (20)

Recently uploaded

Recently uploaded (20)

Session 3&4.pptx