2. REVIEW
• To find out the relationship between Y an X we
1. Collect a sample of data
2. Run a regression of Y on X
3. Find the beta
Problem: what if the sample we collected gave us an incorrect beta and in fact there is no
relationship between Y and X?
3. HYPOTHESIS TESTING
A BIG PICTURE REVIEW OF WHERE WE ARE GOING
We want to learn about the slope of the population regression line. We have data from a
sample, so there is sampling uncertainty. There are five steps towards this goal:
1. State the population object of interest
2. Provide an estimator of this population object
3. In large samples the distribution of the estimator will be normal, use that!
4. Find the standard error (estimator of the standard deviation) of the sampling
distribution of the estimator and use that to conduct tests
4. WHAT TESTS?
• The test that the beta (slope estimator) that we found is not zero.
• If in 𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑢𝑖 𝛽1 = 0 then there is no relationship between Y and X. Case
closed
• Basically we will assume that 𝛽1 = 0 and prove that from that assumption we are
getting something that can not be true
• If we do not have enough evidence to prove the above then 𝛽1 = 0 and there is no
relationship between Y and X
5. MORE ON STANDARD ERROR
• Remember we ask a question about a population and then try to find the answer by
drawing repeated samples.
Example: what is the relationship between the amount of caffeinated drinks one consumes
and the number of hours they sleep?
The standard deviation of the betas you found is 0.24 and the mean is -0.26114
• In reality we only have one sample, hence can’t calculate the standard deviation, we will
instead estimate it and call it standard error
6. EXAMPLE
1. Population object of interest: the relationship between class size and test scores
𝑡𝑒𝑠𝑡𝑠𝑐𝑟𝑖 = 𝛽0 + 𝛽1 𝑆𝑇𝑅𝑖 + 𝑢𝑖
• Estimator 𝛽1, estimate -2.28
• Sample size – large?
7. HYPOTHESIS TEST
• We are going to assume that the relationship between class size and test score is zero
(we hypothesize that 𝛽1 = 0) . The mean of the sampling distribution is then 0
• The standard error is the estimator of the standard deviation
• If the estimate is too far away from the mean, this means that there is a very low chance
of getting that estimate, our earlier assumption is incorrect
• To calculate the distance we will compare
𝛽1−𝛽1𝐻
𝑆𝐸
(called t-statistic) to one of the t-critical
values: 1.65; 1.96; 2.58
8. EXAMPLE CONT’D
• The standard error is 0.479
• The t-stat is then (-2.58-0)/0.479=-4.75
• This is less than all three of the critical values.
• We will always pick the critical value furthest away from zero to describe the results
• Hence here we will build our conclusion on -2.58 since it is the farthest from zero.
• The conclusion is that the “coefficient on student-teacher ratio is statistically significant
at 1%” or “we reject the null hypothesis at 1% level)
• This means that the probability of falsely rejecting our assumption/hypothesis (𝛽1 = 0)
given that it is true is 1%
• Note that this also means that the coefficient on student-teacher ratio is statistically
significant at 5% and 10%. Because it is significant at 1%, 5% and 10% significance levels
are implied
9. LANGUAGE
• Notice that we either
• Reject the null hypothesis or
• Fail to reject the null hypothesis (if the estimated beta is too close to the hypothesized
value of zero)/we don’t have enough evidence to reject the null hypothesis
• We never accept the null hypothesis
• If we fail to reject the null then we do not interpret the beta. The beta in this case is
not statistically different from zero.
10. EXAMPLE
• Let’s imagine we are asking the same question but we get a slightly different standard
error, standard error = 1.29
• What is the null hypothesis?
• What is the alternative hypothesis?
• What is the t-stat?
• Which t-critical value should we compare it to?
• What is our conclusion?
11. EXAMPLE 2
• Let’s imagine we are asking the same question but we get a slightly different standard
error, standard error = 1.43
• What is the null hypothesis?
• What is the alternative hypothesis?
• What is the t-stat?
• Which t-critical value should we compare it to?
• What is our conclusion?
12. EXAMPLE 3
• Let’s imagine we are asking the same question but we get a slightly different standard
error, standard error = 1.15
• What is the null hypothesis?
• What is the alternative hypothesis?
• What is the t-stat?
• Which t-critical value should we compare it to?
• What is our conclusion?
13. T-STATS AND LEVELS OF SIGNIFICANCE
• If you find the t-stat to be
1. -0.7
2. 3.25
3. 1.14
4. -1.88
5. 2.08
What corresponding levels of statistical significance would you chose?
14. A NEW EXAMPLE!
• We are asking what is the relationship between GDP and mortality.
• What is our population of interest? Parameter? Estimator?
• What is the null hypothesis?
• Alternative hypothesis?
• Run the regression of mortality on GDP in Stata
• The standard error is printed in Stata output
• What is the t-stat?
• What critical value should we compare it to?
• What is our conclusion?
• To make your life easier, Stata also prints out the t-stat so you do not have to calculate it
each time you run a regression
15. DIGRESSION BUT AN IMPORTANT ONE
• How do we interpret the beta on GDP? (GDP is in dollars, mortality is in infant deaths
per 1000 live births)
• Is this interesting?
• What is the mean and standard deviation of GDP?
• How can we make our interpretation sound more interesting?
16. BACK TO HYPOTHESIS TESTING
• Use the real estate data set, please find the t-stat and corresponding significance level in
the regressions of price of a house on the
• Number of bedrooms
• Age of the house
• Please interpret the coefficients
17. P-VALUE
• P-value is your best friend!
• It is printed in Stata output
• The smallest significance level at which the null hypothesis could be rejected, based on
the test statistic actually observed
• For example, p-value lower than 1 percent (0.01) means that the null hypothesis would
be rejected at the 1 percent level, p-value between 0.01 and 0.05 means that the null
hypothesis would be rejected at the 5 percent level
19. P-VALUE, EXAMPLE
• In the regression of test score on class size the P-value is 0.000. This means it is less than
0.01 (1%). Thus, we can reject the null hypothesis at 1% level
• By the way, what is the null hypothesis?
• If P-value is between 0.01 and 0.05 then we can reject the null hypothesis at 5% level
• If P-value is between 0.05 and 0.1 then we can reject the null hypothesis at 10% level
• If P-value is 0.1 or greater then we fail to reject the null
• What are the p-values in the regressions of price of the house on its age and on the
number of bedrooms?
20. IN THE REAL WORLD
• You will meet two types of people
• Those who know of hypothesis testing (for example, if you go to graduate school). If you
write a paper and you use regression analysis in that paper no need to discuss what
your null and alternative hypotheses are. You only need to interpret statistically
significant betas, mention that they are statistically significant (and at what level) and
mention if betas are not statistically significant
• Those who have no idea what hypothesis testing is (for example, your client at work).
You only then need to interpret statistically significant betas. Make sure your
interpretation is coherent
21. PRACTICE
• Please use California schools dataset and regress test scores on
• Number of computers
• Number of computers per student
• Average income
• Number of teachers
• Based on the results of the regressions interpret the coefficients and state the level of
their statistical significance
22. REVIEW
• What is the default hypothesized value?
• What happens if we hypothesize the relationship between two variables to be
something else?
• T-critical values and corresponding significance levels. Please give an example.
• How to use p-value. Please give an example.
• Do you interpret a coefficient if it is not statistically significant?
Editor's Notes
Lets imagine we had big enough samples. Then if I drew the normal curve the mean would be in the middle, betas from 90% of the samples will lie in between -1.65*standard deviation and 1.65*standard deviation. 95% of the observations will lie in between -1.96*standard deviation and 1.96*standard deviation, and 99% of the observations will lie between -2.58*sd and 2.58*sd
Hopefully the sample size is large enough so that the distritbution of our beta is normal. Graph the normal distribution curve. Show that most of the observations will lie around the mean. Most of them will be zero, in fact betas from 90% of the samples will lie in between -1.65*standard deviation and 1.65*standard deviation. 95% of the observations will lie in between -1.96*standard deviation and 1.96*standard deviation, and 99% of the observations will lie between -2.58*sd and 2.58*sd