Student Profile Sample - We help schools to connect the data they have, with ...
Ā
Modeling Social Data, Lecture 6: Regression, Part 1
1. Regression
APAM E4990
Modeling Social Data
Jake Hofman
Columbia University
February 24, 2017
Jake Hofman (Columbia University) Regression February 24, 2017 1 / 6
4. Deļ¬nition
āThe primary goal in a regression analysis is to
understand, as far as possible with the available data,
how the conditional distribution of the response varies
across subpopulations determined by the possible
values of the predictor or predictors.ā
-āApplied Regression Including Computing and Graphicsā
Cook & Weisberg (1999)
Jake Hofman (Columbia University) Regression February 24, 2017 2 / 6
5. Goals
Describe
Provide a compact summary of outcomes under diļ¬erent conditions
Predict
Make forecasts for future outcomes or unobserved conditions
Explain
Account for associations between predictors and outcomes
Jake Hofman (Columbia University) Regression February 24, 2017 3 / 6
6. Goals
Describe
Provide a compact summary of outcomes under diļ¬erent conditions
Never āfalseā, but may be wasteful or misleading
Predict
Make forecasts for future outcomes or unobserved conditions
Varying degrees of success, often room for improvement
Explain
Account for associations between predictors and outcomes
Diļ¬cult to establish causality in observational studies
See āRegression Analysis: A Constructive Critiqueā, Berk (2004)
Jake Hofman (Columbia University) Regression February 24, 2017 3 / 6
7. Goals
Models should be ļ¬exible enough to describe observed phenomena
but simple enough to generalize to future observations
Jake Hofman (Columbia University) Regression February 24, 2017 4 / 6
8. Examples1
1.2 Setting the Regression Context 3
Should one be especially interested in a comparison of the means, one could
roceed descriptively with a conventional least squares regression analysis as
special case. That is, for each observation i, one could let
Ėyi = Ī²0 + Ī²1xi, (1.1)
here the response variable y is each applicantās SAT score, x is an indicator
Fig. 1.2. Distribution of SAT scores for Asian applicants.
SAT Scores for Asian Applicants
SAT Score
Frequency
600 800 1000 1200 1400 1600
050100150
of some response y varies across subpopulations determined by the po
values of the predictor or predictorsā (Cook and Weisberg, 1999: 27).
is, interest centers on the distribution of the response variable Y conditi
on one or more predictors X.
This deļ¬nition includes a wide variety of elementary procedures e
implemented in R. (See, for example, Maindonald and Braun, 2007: Ch
2.) For example, consider Figures 1.1 and 1.2. The ļ¬rst shows the distrib
of SAT scores for recent applicants to a major university, who self-ide
as āHispanic.ā The second shows the distribution of SAT scores for r
applicants to that same university, who self-identify as āAsian.ā
Fig. 1.1. Distribution of SAT scores for Hispanic applicants.
It is clear that the two distributions diļ¬er substantially. The Asian
tribution is shifted to the right, leading to a distribution with a higher
(1227 compared to 1072), a smaller standard deviation (170 compared to
and greater skewing. A comparative description of the two histograms
constitutes a proper regression analysis. Using various summary stati
some key features of the two displays are compared and contrasted (
1
āStatistical Learning from a Regression Perspectiveā, Berk (2008)
Jake Hofman (Columbia University) Regression February 24, 2017 5 / 6
9. Examples1
aph more legible.
2e+04 4e+04 6e+04 8e+04 1e+05
8001000120014001600
SAT Score by Household Income
Income Bounded at $100,000
SATScore
Fig. 1.4. SAT scores by family income.1
āStatistical Learning from a Regression Perspectiveā, Berk (2008)
Jake Hofman (Columbia University) Regression February 24, 2017 5 / 6
10. Examples1
6 1 Regression Framework
1234
400 600 800 1000 1200 1400 1600
400 600 800 1000 1200 1400 1600 400 600 800 1000 1200 1400 1600
1234
FreshmanGPA
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
High School GPA
Fig. 1.5. Freshman GPA on SAT holding high school GPA constant.1
āStatistical Learning from a Regression Perspectiveā, Berk (2008)
Jake Hofman (Columbia University) Regression February 24, 2017 5 / 6
11. Framework
ā¢ Specify the outcome and predictors, along with the form of
the model relating them
ā¢ Deļ¬ne a loss function that quantiļ¬es how close a modelās
predictions are to observed outcomes
ā¢ Develop an algorithm to ļ¬t the model to the observations by
minimizing this loss
ā¢ Assess model performance and interpret results.
Jake Hofman (Columbia University) Regression February 24, 2017 6 / 6