2. Data Description
College Scorecard data: https://www.kaggle.com/kaggle/college-scorecard
● Data collected from 1996 - 2013
● 2009 dataset chosen for completeness and recency
● 7149 observations / 1484 features
● Each observation corresponds to a unique College
● Features related to demographics, cost of attendance, proportion of students
receiving financial aid, earnings multiple years after matriculation, etc
3. Data Description
● Lots of missing data!
● Some information not reported by specific Colleges
● Some information suppressed for privacy
4. Data Processing
● Variables with >15% of observations missing were removed
● Response variable created as a ratio of median earnings six years after
matriculation vs. median debt
● For each variable, missing values were replaced with the median of non-missing
values
● Highly correlated and low variance variables were removed
6. Analysis
● Originally we intended to use data from 2009 to predict earnings to debt ratio for
2011
● Predictors with low amounts of missing values in 2009 had large amounts of
missing values in 2011, and vice versa
● Final data consisted of 5130 observations and 223 predictors
● 2009 data split into training (70%) and testing (30%) sets
7. Methodology
Linear Model:
● Poor performance (negative predicted ratios)
Lasso:
● Exploratory lasso model selected ~120-130 variables for various iterations
● Models resulted in MSE of ~0.45 (R2 ~0.65)
Principal Component Analysis
● No single predictor explained a significant percentage of variance
8. Random Forest Explained
● Ensemble learning method that aggregates regression trees
● A subset of the total predictors is used to build each tree
● + Handles large numbers of variable without deletion
● + Runs efficiently on large data sets
● + Inherent treating of interactions between variables
● - Loss of interpretability
12. Conclusion
● Missing data provided greatest challenge to building an accurate model
● Data was decidedly unclean - redundant variables, missing factor levels, etc
● Significant amount of data processing required (~¾ of time spent)
● Imputing missing data with median values increased model performance
● The large amount of missing data likely sets an upper bound on the performance
of this model, but more data processing, feature engineering, and additional
tuning of parameters could result in more robust performance.