2. Descriptive Statistics
• Tabular, graphical, or numerical summaries of data.
Age
Mean 42.57
Median 40
Mode 40
Standard Deviation 10.63
Sample Variance 113.01
Range 44
Minimum 21
Maximum 65
Frequency
Female 12
Male 18
Grand Total 30
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5
Frequency
Opinion
Bar Chart for Opinions
3. Summarizing Data for Categorical Variables
• Let us focus on Tabular and Graphical summaries first. We will deal
with numerical summaries later.
• Tabular:
• Frequency distribution
• Relative frequency distribution
• Percent frequency distribution
• Graphical:
• Bar chart
• Pie chart
4. Frequency Distribution
• A frequency distribution is a tabular summary of data showing the
number (frequency) of observations in each of several non-
overlapping categories or classes.
Opinion Frequency
Strongly disagree 8
Disagree 4
Neutral 6
Agree 7
Strongly agree 5
Grand Total 30
5. Relative Frequency Distribution
Relative frequency of a class =
Frequency of the class
Total number of observations
Percent frequency of a class =
Frequency of the class
Total number of observations
× 100 %
6. Opinion Frequency Relative frequency Percent Frequency
Strongly disagree 8 0.27 27%
Disagree 4 0.13 13%
Neutral 6 0.20 20%
Agree 7 0.23 23%
Strongly agree 5 0.17 17%
Grand Total 30 1.00 100%
9. Summarizing Data for Quantitative Variables
• Let us focus on Tabular and Graphical summaries first. We will deal
with numerical summaries later.
• Tabular:
• Frequency distribution
• Relative frequency distribution
• Percent frequency distribution
• Graphical:
• Histogram
10. Frequency Distribution
• We need to bin/bucket the quantitative variable of interest.
• Three Steps:
1. Determine the number of nonoverlapping classes.
2. Determine the width of each class.
3. Determine the class limits.
• Choosing the number of classes is tricky! It is done by trial and error.
• Five to twenty classes are preferred. (Not too few, not too many, just
enough to informatively show the variation in the frequencies.)
15. Skewness
• To which side is the tail of the distribution longer or more drawn out?
• Positive/Right skew
• Negative/Left skew
• Zero skewness means symmetric distribution.
21. Scatterplot: Visualizing the Relationship
Between Two Quantitative Variables
$0
$10,000
$20,000
$30,000
$40,000
$50,000
$60,000
$70,000
$80,000
$90,000
0 10 20 30 40 50 60 70
Salary
Age (years)
Salary vs. Age
22.
23. Creating Effective Graphical Displays
• Give the display a clear and concise title.
• Keep the display simple.
• Clearly label each axis and provide the units of measure.
• If colors are used, make sure they are distinct.
• If multiple colors or line types are used, provide a legend.
24. Statistical Inference (Recap)
Population
Sample
Population parameter
E.g., Population average income 𝜇
Draw
Infer
Sample statistic
E.g., Sample average income 𝑥
A sample statistic is a point estimator of the corresponding
population parameter.
25. Descriptive Statistics: Numerical Measures
• Measures of location:
• Measures of central location: (A
single number which indicates a
typical value of the data.)
• Sample mean
• Sample median
• Sample mode
• Sample percentiles
• Sample quartiles
• Measures of variability: (A single
number which indicates the
variability in the data.)
• Sample range
• Sample IQR
• Sample variance
• Sample standard deviation
• Measures of distribution shape: (A
single number which lets us know
the shape of the distribution of the
data.)
• Skewness
• Kurtosis
26. Some Common Notation
• Let 𝑥 represent a variable of interest.
• Let 𝑛 be the number of observations in the sample. This is the sample
size.
• Let 𝑥𝑖 be the 𝑖𝑡ℎ observation.
• Let 𝑁 be the number of observations in the population. This is the
size of the population.
27. Measures of Location
• Measures of central location: (A single number which indicates a
typical value of the data.)
• Sample mean
• Sample median
• Sample mode
• Sample percentiles
• Sample quartiles
29. Sample Median
• The median of a data set is the value in the middle when the data
items are arranged in ascending order.
• The median divides the dataset into two parts, each with
approximately 50% of observations.
• Arrange the data in ascending order (smallest value to largest value).
• For an odd number of observations, the median is the middle value.
• For an even number of observations, the median is the average of the two
middle values.
30. Sample Mode
• The mode of a data set is the value that occurs with greatest
frequency.
31. Sample Percentile
• The 𝑝𝑡ℎ
percentile is a value such that at least 𝒑 percent of the
observations are less than or equal to this value and at least (𝟏𝟎𝟎 −
𝒑) percent of the observations are greater than or equal to this value
32. Sample Percentile
• Arrange the data in ascending order.
• Location of the 𝑝𝑡ℎ percentile:
𝐿𝑝 =
𝑝
100
(𝑛 + 1)
33. Sample Quartiles
• The quartiles divide the dataset into four parts, each with
approximately 25% of observations.
• First Quartile 𝑄1 = 25th Percentile
• Second Quartile 𝑄2 = 50th Percentile
• Third Quartile 𝑄3 = 75th Percentile
34.
35. Measures of Variability
• Measures of variability: (A single number which indicates the
variability in the data.)
• Sample range
• Sample IQR
• Sample variance
• Sample standard deviation
38. Box Plot
Q1
Median
Q3
Max value less
than inner fence
Min value greater
than inner fence
Q3 + 1.5*IQR
Inner fence
Q3 + 3*IQR
Outer fence
Q1 – 1.5*IQR
Inner fence
Q1 – 3*IQR
Outer fence
Major outlier Minor outlier
41. Chebyshev’s Theorem
• At least (1 −
1
𝑧2) of the data values must be within 𝑧 standard
deviations of the mean, where 𝑧 is any value greater than 1.
42. Suppose that you are interested in analyzing the amount of time spent
by users browsing through Swiggy before they come to a decision
about what to order. You know that the average time spent browsing is
6.9 minutes. Suppose that the standard deviation is 1.2 minutes.
• What can you say about the percentage of users who spend between
4.5 minutes and 9.3 minutes browsing Swiggy?
• What can you say about the percentage of users who spend between
5.4 minutes and 9.3 minutes browsing Swiggy?
44. Covariance
• Covariance is a descriptive measure of the strength of linear association
between two variables.
Sample covariance 𝑠𝑥𝑦 = 𝑖=1
𝑛
𝑥𝑖−𝑥 𝑦𝑖−𝑦
𝑛−1
Population Covariance 𝜎𝑥𝑦 = 𝑖=1
𝑁
𝑥𝑖−𝜇𝑥 𝑦𝑖−𝜇𝑦
𝑁
• +ve value +ve relationship
• -ve value -ve relationship
• Sensitive to units of measurement of the variables!
45.
46.
47.
48. Correlation
• Correlation coefficient is a dimensionless measure of the strength of linear association
between two variables.
Sample correlation coefficient 𝑟𝑥𝑦 =
𝑠𝑥𝑦
𝑠𝑥𝑠𝑦
Population correlation coefficient 𝜌𝑥𝑦 =
𝜎𝑥𝑦
𝜎𝑥𝜎𝑦
• Bounded between [-1, 1]
• Values close to 0 indicate weak linear relationship.
• Values close to 1 indicate strong positive linear relationship.
• Values close to -1 indicate strong negative linear relationship.