SlideShare ist ein Scribd-Unternehmen logo
1 von 13
College Scorecard
Predicting Earnings To Debt Ratio
Emdadul Haque and Derek Atwood
Data Description
College Scorecard data: https://www.kaggle.com/kaggle/college-scorecard
● Data collected from 1996 - 2013
● 2009 dataset chosen for completeness and recency
● 7149 observations / 1484 features
● Each observation corresponds to a unique College
● Features related to demographics, cost of attendance, proportion of students
receiving financial aid, earnings multiple years after matriculation, etc
Data Description
● Lots of missing data!
● Some information not reported by specific Colleges
● Some information suppressed for privacy
Data Processing
● Variables with >15% of observations missing were removed
● Response variable created as a ratio of median earnings six years after
matriculation vs. median debt
● For each variable, missing values were replaced with the median of non-missing
values
● Highly correlated and low variance variables were removed
Data Processing
● Outliers diagnosed and removed (~0.5% of response variable)
Analysis
● Originally we intended to use data from 2009 to predict earnings to debt ratio for
2011
● Predictors with low amounts of missing values in 2009 had large amounts of
missing values in 2011, and vice versa
● Final data consisted of 5130 observations and 223 predictors
● 2009 data split into training (70%) and testing (30%) sets
Methodology
Linear Model:
● Poor performance (negative predicted ratios)
Lasso:
● Exploratory lasso model selected ~120-130 variables for various iterations
● Models resulted in MSE of ~0.45 (R2 ~0.65)
Principal Component Analysis
● No single predictor explained a significant percentage of variance
Random Forest Explained
● Ensemble learning method that aggregates regression trees
● A subset of the total predictors is used to build each tree
● + Handles large numbers of variable without deletion
● + Runs efficiently on large data sets
● + Inherent treating of interactions between variables
● - Loss of interpretability
Random Forest
Random Forest
Final Model:
One-half of the total predictors used per tree
Forest of 200 trees
MSE of ~0.3 (R2 ~ 0.75)
Conclusion
● Missing data provided greatest challenge to building an accurate model
● Data was decidedly unclean - redundant variables, missing factor levels, etc
● Significant amount of data processing required (~¾ of time spent)
● Imputing missing data with median values increased model performance
● The large amount of missing data likely sets an upper bound on the performance
of this model, but more data processing, feature engineering, and additional
tuning of parameters could result in more robust performance.
Questions?

Weitere ähnliche Inhalte

Andere mochten auch (13)

Treatment powerpoint [autosaved]
Treatment powerpoint [autosaved]Treatment powerpoint [autosaved]
Treatment powerpoint [autosaved]
 
Article1
Article1Article1
Article1
 
Top 12 Employer-Provided Benefits of 2016
Top 12 Employer-Provided Benefits of 2016Top 12 Employer-Provided Benefits of 2016
Top 12 Employer-Provided Benefits of 2016
 
UMA BOA PROSA COM A REDE ENTRELAÇOS
UMA BOA PROSA COM A REDE ENTRELAÇOS UMA BOA PROSA COM A REDE ENTRELAÇOS
UMA BOA PROSA COM A REDE ENTRELAÇOS
 
EDL
EDLEDL
EDL
 
Consent form
Consent formConsent form
Consent form
 
La rassegna stampa dei Bagni Misteriosi
La rassegna stampa dei Bagni MisteriosiLa rassegna stampa dei Bagni Misteriosi
La rassegna stampa dei Bagni Misteriosi
 
TheDiasppearingSpoonLessonPlan
TheDiasppearingSpoonLessonPlanTheDiasppearingSpoonLessonPlan
TheDiasppearingSpoonLessonPlan
 
Cuidados del cuerpo
Cuidados del cuerpoCuidados del cuerpo
Cuidados del cuerpo
 
How to build your startup in 13 steps?
How to build your startup in 13 steps?How to build your startup in 13 steps?
How to build your startup in 13 steps?
 
Telemedicine Facts Infographic
Telemedicine Facts InfographicTelemedicine Facts Infographic
Telemedicine Facts Infographic
 
Sistema operativo windows 7
Sistema operativo windows 7Sistema operativo windows 7
Sistema operativo windows 7
 
Austin.
Austin.Austin.
Austin.
 

Ähnlich wie Project presentation slides

Ähnlich wie Project presentation slides (20)

KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
Web search-metrics-tutorial-www2010-section-2of7-relevance
Web search-metrics-tutorial-www2010-section-2of7-relevanceWeb search-metrics-tutorial-www2010-section-2of7-relevance
Web search-metrics-tutorial-www2010-section-2of7-relevance
 
Statistical Databases
Statistical DatabasesStatistical Databases
Statistical Databases
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
Effect of Temporal Collaboration Network, Maintenance Activity, and Experienc...
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Employee Turnover Solution Using Analytical Techniques
Employee Turnover Solution Using Analytical TechniquesEmployee Turnover Solution Using Analytical Techniques
Employee Turnover Solution Using Analytical Techniques
 
Methods of Optimization in Machine Learning
Methods of Optimization in Machine LearningMethods of Optimization in Machine Learning
Methods of Optimization in Machine Learning
 
Chronic Absenteeism Rate Prediction: A Data Science Case Study
Chronic Absenteeism Rate Prediction: A Data Science Case StudyChronic Absenteeism Rate Prediction: A Data Science Case Study
Chronic Absenteeism Rate Prediction: A Data Science Case Study
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016 Performance OR Capacity #CMGimPACt2016
Performance OR Capacity #CMGimPACt2016
 
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
Graduate admission Prediction: Comparing Regression and Classification models
Graduate admission Prediction: Comparing Regression and Classification modelsGraduate admission Prediction: Comparing Regression and Classification models
Graduate admission Prediction: Comparing Regression and Classification models
 
Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...Supercharge your AB testing with automated causal inference - Community Works...
Supercharge your AB testing with automated causal inference - Community Works...
 
Paper planes short ver linkedin
Paper planes  short ver   linkedinPaper planes  short ver   linkedin
Paper planes short ver linkedin
 
Galambos_SlidesNEAIR2015
Galambos_SlidesNEAIR2015Galambos_SlidesNEAIR2015
Galambos_SlidesNEAIR2015
 
Closing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data AnalysisClosing The Loop for Evaluating Big Data Analysis
Closing The Loop for Evaluating Big Data Analysis
 

Kürzlich hochgeladen

Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
ZurliaSoop
 
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
gynedubai
 
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证
一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证
一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证
eqaqen
 
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
yynod
 
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...
Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...
Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...
Angela Justice, PhD
 
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
nirzagarg
 
Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........
deejay178
 
怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制
怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制
怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制
yynod
 
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
HyderabadDolls
 

Kürzlich hochgeladen (20)

Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
Jual obat aborsi Jakarta ( 085657271886 )Cytote pil telat bulan penggugur kan...
 
Complete Curriculum Vita for Paul Warshauer
Complete Curriculum Vita for Paul WarshauerComplete Curriculum Vita for Paul Warshauer
Complete Curriculum Vita for Paul Warshauer
 
Personal Brand Exploration ppt.- Ronnie Jones
Personal Brand  Exploration ppt.- Ronnie JonesPersonal Brand  Exploration ppt.- Ronnie Jones
Personal Brand Exploration ppt.- Ronnie Jones
 
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
<DUBAI>Abortion pills IN UAE {{+971561686603*^Mifepristone & Misoprostol in D...
 
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
Girls in Aiims Metro (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In chittoor [ 7014168258 ] Call Me For Genuine Models ...
 
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hubli [ 7014168258 ] Call Me For Genuine Models We ...
 
一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证
一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证
一比一定(购)中央昆士兰大学毕业证(CQU毕业证)成绩单学位证
 
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
怎样办理哥伦比亚大学毕业证(Columbia毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Jabalpur [ 7014168258 ] Call Me For Genuine Models ...
 
Personal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando NegronPersonal Brand Exploration - Fernando Negron
Personal Brand Exploration - Fernando Negron
 
Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Agartala [ 7014168258 ] Call Me For Genuine Models ...
 
Joshua Minker Brand Exploration Sports Broadcaster .pptx
Joshua Minker Brand Exploration Sports Broadcaster .pptxJoshua Minker Brand Exploration Sports Broadcaster .pptx
Joshua Minker Brand Exploration Sports Broadcaster .pptx
 
Brand Analysis for reggaeton artist Jahzel.
Brand Analysis for reggaeton artist Jahzel.Brand Analysis for reggaeton artist Jahzel.
Brand Analysis for reggaeton artist Jahzel.
 
Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...
Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...
Simple, 3-Step Strategy to Improve Your Executive Presence (Even if You Don't...
 
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Shivamogga [ 7014168258 ] Call Me For Genuine Model...
 
Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........Gabriel_Carter_EXPOLRATIONpp.pptx........
Gabriel_Carter_EXPOLRATIONpp.pptx........
 
怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制
怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制
怎样办理伊利诺伊大学厄巴纳-香槟分校毕业证(UIUC毕业证书)成绩单学校原版复制
 
Guide to a Winning Interview May 2024 for MCWN
Guide to a Winning Interview May 2024 for MCWNGuide to a Winning Interview May 2024 for MCWN
Guide to a Winning Interview May 2024 for MCWN
 
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
Howrah [ Call Girls Kolkata ₹7.5k Pick Up & Drop With Cash Payment 8005736733...
 

Project presentation slides

  • 1. College Scorecard Predicting Earnings To Debt Ratio Emdadul Haque and Derek Atwood
  • 2. Data Description College Scorecard data: https://www.kaggle.com/kaggle/college-scorecard ● Data collected from 1996 - 2013 ● 2009 dataset chosen for completeness and recency ● 7149 observations / 1484 features ● Each observation corresponds to a unique College ● Features related to demographics, cost of attendance, proportion of students receiving financial aid, earnings multiple years after matriculation, etc
  • 3. Data Description ● Lots of missing data! ● Some information not reported by specific Colleges ● Some information suppressed for privacy
  • 4. Data Processing ● Variables with >15% of observations missing were removed ● Response variable created as a ratio of median earnings six years after matriculation vs. median debt ● For each variable, missing values were replaced with the median of non-missing values ● Highly correlated and low variance variables were removed
  • 5. Data Processing ● Outliers diagnosed and removed (~0.5% of response variable)
  • 6. Analysis ● Originally we intended to use data from 2009 to predict earnings to debt ratio for 2011 ● Predictors with low amounts of missing values in 2009 had large amounts of missing values in 2011, and vice versa ● Final data consisted of 5130 observations and 223 predictors ● 2009 data split into training (70%) and testing (30%) sets
  • 7. Methodology Linear Model: ● Poor performance (negative predicted ratios) Lasso: ● Exploratory lasso model selected ~120-130 variables for various iterations ● Models resulted in MSE of ~0.45 (R2 ~0.65) Principal Component Analysis ● No single predictor explained a significant percentage of variance
  • 8. Random Forest Explained ● Ensemble learning method that aggregates regression trees ● A subset of the total predictors is used to build each tree ● + Handles large numbers of variable without deletion ● + Runs efficiently on large data sets ● + Inherent treating of interactions between variables ● - Loss of interpretability
  • 10. Random Forest Final Model: One-half of the total predictors used per tree Forest of 200 trees MSE of ~0.3 (R2 ~ 0.75)
  • 11.
  • 12. Conclusion ● Missing data provided greatest challenge to building an accurate model ● Data was decidedly unclean - redundant variables, missing factor levels, etc ● Significant amount of data processing required (~¾ of time spent) ● Imputing missing data with median values increased model performance ● The large amount of missing data likely sets an upper bound on the performance of this model, but more data processing, feature engineering, and additional tuning of parameters could result in more robust performance.