SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Job Salary Prediction
with Python
John Maiden
NYC Data Science Academy
Motivation
How did I scrape the jobs?
Jobs were scraped from the following:
‱ Dice.com
‱ Monster.com
‱ Startupers.com
‱ Ventureloop.com
All credit goes to:
Craig Perler
CTO & Founder
projectSHERPA
Job Selection Process
ï‚Ą Iterative process based on keywords, text analysis using
NLTK, and manually reviewing the jobs
ï‚Ą Went from 70k jobs to final set of 585 based on the following
criteria:
 Keywords:
â–Ș (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR
â–Ș Job description contained: 'predictive modeling', 'data mining', 'text mining',
'machine learning', 'natural language processing') AND
â–Ș Job title did not contain: 'intern', 'internship'
 Skills
â–Ș Job had at least one tagged skill and did not contain the following: 'Salesforce',
'VBA', 'Sharepoint', 'Drupal'
Cleaning Up the Jobs Part 1
ï‚Ą Job data was good, but needed some further
cleaning
 Converted the posted pay rate text into a number
 Assigned job seniority ('junior', 'default', 'senior') based on job title
keywords
ï‚Ą Used python-linkedin to pull in more
company data
 Name, Description, Industry, Company Size, Company Type,
Specialities
Cleaning Up the Jobs Part 2
ï‚Ą Added more data columns
 Job Posted Year, Job Posted Month, Number of words in Job
Description, Number of characters in Job Description
ï‚Ą Converted text fields to numeric values
 Job Seniority, Employee Count, Company Type
ï‚Ą Converted Company Industry and Specialities to
binary valued columns
 Used only the following specialities: 'Big Data', 'Analytics', 'Machine
Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural
Language Processing', 'Predictive Analytics', 'Data Mining'
Modeling Time!
Problem
 Only 67 jobs (out of 585) have posted salaries!
Solution
What do we do?
Test using two different datasets
 Pay Rate - job salaries as provided by the poster
(67 records)
 Estimated Salary - job salaries as provided by a
separate model (584 records)
Model Selection
ï‚Ą Most of the competitors in the Kaggle competition
used Random Forest.
ï‚Ą Zygmunt Zając made a suggestion:
Initial Results
Tested with linear models and Random Forest
Model Training Score Testing Score
Ordinary Linear
Regression
0.795 -32,120,852.970
Ridge
Regression
0.669 -0.079
Lasso Regression -0.009 -0.021
Random Forest 1.000 0.147
Model Training Score Testing Score
Ordinary Linear
Regression
0.430 -330,266.154
Ridge
Regression
0.361 0.163
Lasso Regression 0.031 -0.051
Random Forest 1.000 0.325
Pay Rate Estimated Salary
Can we do better???
Changing the shape of the data
Original Data Log Data
Sqrt Data
Smoothing out the data
Our salary data is too granular - let's round to
units of 10k
Original Data Smoothed Data
Expanding the model set
ï‚Ą Random Forest is good but slow - try
Decision Tree to see if we can get
comparable results
ï‚Ą LDA - can use a classification model on the
smoothed data
ï‚Ą KNN - why not?
Reviewing the code
http://xkcd.com/221/
Model Results
ï‚Ą Linear Models (OLS, Ridge, Lasso)
 Universally poor on the small data set
 Of the Linear Models, Ridge was the best on the large data set
ï‚Ą Decision Tree/Random Forest
 Overfitted on the small data set (great training score, poor test score)
 Had the best results on the large data set (Random Forest)
ï‚Ą KNN
 Comparable to the non-"Linear Models" on the small and large data sets
ï‚Ą LDA
 Needed rather large K to get good results
 Had the best results on the small data set
Final Results
Model Training Score Testing Score
KNN 1.000 0.147
Model Training Score Testing Score
Decision Tree 1.000 0.342
Pay Rate Estimated Salary
Model Training Score Testing Score
LDA 0.585 0.286
Model Training Score Testing Score
Random Forest 1.000 0.581
Smoothed Pay Rate Smoothed Estimated Salary
Reviewing the data
http://blog.mindjet.com/2011/12/drowning-from-information-overload/
How can we do better next
time?
ï‚Ą More data!
 Either more data points or expand the parameters
of the model
ï‚Ą Keep playing with the shape of the data
 Improve the ranges - 20k vs. 30k is more
significant than 150k vs. 160k
ï‚Ą Improve the quality of the data
 Verify the LinkedIn data, Job Seniority, etc.
Additional Thoughts
Why I think a project like this is a marketing gimmick:
ï‚Ą Only recruiters post expected salary
ï‚Ą Too much variance in job titles and not enough in
the job description
ï‚Ą Only provides base salary and ignores bonus and
non-cash compensation
ï‚Ą Cannot handle deprecated skills or brand new skills
References
ï‚Ą projectSHERPA Homepage,
http://projectsherpa.com/
ï‚Ą "Predict the salary of any UK job ad based on
its contents", http://www.kaggle.com/c/job-
salary-prediction
ï‚Ą "Predicting advertised salaries",
http://fastml.com/predicting-advertised-
salaries/

Weitere Àhnliche Inhalte

Ähnlich wie Ds final project jwm

Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Institute of Contemporary Sciences
 
TSI Final Presentation
TSI Final PresentationTSI Final Presentation
TSI Final Presentation
Marco Better
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
Eric Esajian
 

Ähnlich wie Ds final project jwm (20)

Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With LuminaireScaling AutoML-Driven Anomaly Detection With Luminaire
Scaling AutoML-Driven Anomaly Detection With Luminaire
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVP
 
Benchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academiaBenchmarking search relevance in industry vs academia
Benchmarking search relevance in industry vs academia
 
Voice of the Market, Tom Anderson
Voice of the Market, Tom AndersonVoice of the Market, Tom Anderson
Voice of the Market, Tom Anderson
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit Data
 
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic BorstnarSupporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
Supporting B2Bsales forecasting by machine learning - Mirjana Klajic Borstnar
 
DataScholar.io
DataScholar.ioDataScholar.io
DataScholar.io
 
Loan Default Prediction with Machine Learning
Loan Default Prediction with Machine LearningLoan Default Prediction with Machine Learning
Loan Default Prediction with Machine Learning
 
TSI Final Presentation
TSI Final PresentationTSI Final Presentation
TSI Final Presentation
 
Data mining on yelp dataset
Data mining on yelp datasetData mining on yelp dataset
Data mining on yelp dataset
 
1시간만에 ëšžì‹ ëŸŹë‹ 개념 따띌 ìžĄêž°
1시간만에 ëšžì‹ ëŸŹë‹ 개념 따띌 ìžĄêž°1시간만에 ëšžì‹ ëŸŹë‹ 개념 따띌 ìžĄêž°
1시간만에 ëšžì‹ ëŸŹë‹ 개념 따띌 ìžĄêž°
 
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
Hiring tips for data roles - Nikunj Verma (C.E.O & Co-founder at CutShort.io)
 
#SIOP15 Presentation On Performance Sorting Using Video Interviews
#SIOP15 Presentation On Performance Sorting Using Video Interviews#SIOP15 Presentation On Performance Sorting Using Video Interviews
#SIOP15 Presentation On Performance Sorting Using Video Interviews
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
Amazon SageMaker 慧ć»șæ©Ÿć™šć­žçż’æŒ”çź—æł• (Level 400)
Amazon SageMaker 慧ć»șæ©Ÿć™šć­žçż’æŒ”çź—æł• (Level 400)Amazon SageMaker 慧ć»șæ©Ÿć™šć­žçż’æŒ”çź—æł• (Level 400)
Amazon SageMaker 慧ć»șæ©Ÿć™šć­žçż’æŒ”çź—æł• (Level 400)
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Experimenting with Data!
Experimenting with Data!Experimenting with Data!
Experimenting with Data!
 
Lead Scoring Case Study
Lead Scoring Case StudyLead Scoring Case Study
Lead Scoring Case Study
 

KĂŒrzlich hochgeladen

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

KĂŒrzlich hochgeladen (20)

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 đŸ„” Book Your One night Stand
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Ds final project jwm

  • 1. Job Salary Prediction with Python John Maiden NYC Data Science Academy
  • 3. How did I scrape the jobs? Jobs were scraped from the following: ‱ Dice.com ‱ Monster.com ‱ Startupers.com ‱ Ventureloop.com All credit goes to: Craig Perler CTO & Founder projectSHERPA
  • 4. Job Selection Process ï‚Ą Iterative process based on keywords, text analysis using NLTK, and manually reviewing the jobs ï‚Ą Went from 70k jobs to final set of 585 based on the following criteria:  Keywords: â–Ș (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR â–Ș Job description contained: 'predictive modeling', 'data mining', 'text mining', 'machine learning', 'natural language processing') AND â–Ș Job title did not contain: 'intern', 'internship'  Skills â–Ș Job had at least one tagged skill and did not contain the following: 'Salesforce', 'VBA', 'Sharepoint', 'Drupal'
  • 5. Cleaning Up the Jobs Part 1 ï‚Ą Job data was good, but needed some further cleaning  Converted the posted pay rate text into a number  Assigned job seniority ('junior', 'default', 'senior') based on job title keywords ï‚Ą Used python-linkedin to pull in more company data  Name, Description, Industry, Company Size, Company Type, Specialities
  • 6. Cleaning Up the Jobs Part 2 ï‚Ą Added more data columns  Job Posted Year, Job Posted Month, Number of words in Job Description, Number of characters in Job Description ï‚Ą Converted text fields to numeric values  Job Seniority, Employee Count, Company Type ï‚Ą Converted Company Industry and Specialities to binary valued columns  Used only the following specialities: 'Big Data', 'Analytics', 'Machine Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural Language Processing', 'Predictive Analytics', 'Data Mining'
  • 7. Modeling Time! Problem  Only 67 jobs (out of 585) have posted salaries! Solution
  • 8. What do we do? Test using two different datasets  Pay Rate - job salaries as provided by the poster (67 records)  Estimated Salary - job salaries as provided by a separate model (584 records)
  • 9. Model Selection ï‚Ą Most of the competitors in the Kaggle competition used Random Forest. ï‚Ą Zygmunt Zając made a suggestion:
  • 10. Initial Results Tested with linear models and Random Forest Model Training Score Testing Score Ordinary Linear Regression 0.795 -32,120,852.970 Ridge Regression 0.669 -0.079 Lasso Regression -0.009 -0.021 Random Forest 1.000 0.147 Model Training Score Testing Score Ordinary Linear Regression 0.430 -330,266.154 Ridge Regression 0.361 0.163 Lasso Regression 0.031 -0.051 Random Forest 1.000 0.325 Pay Rate Estimated Salary Can we do better???
  • 11. Changing the shape of the data Original Data Log Data Sqrt Data
  • 12. Smoothing out the data Our salary data is too granular - let's round to units of 10k Original Data Smoothed Data
  • 13. Expanding the model set ï‚Ą Random Forest is good but slow - try Decision Tree to see if we can get comparable results ï‚Ą LDA - can use a classification model on the smoothed data ï‚Ą KNN - why not?
  • 15. Model Results ï‚Ą Linear Models (OLS, Ridge, Lasso)  Universally poor on the small data set  Of the Linear Models, Ridge was the best on the large data set ï‚Ą Decision Tree/Random Forest  Overfitted on the small data set (great training score, poor test score)  Had the best results on the large data set (Random Forest) ï‚Ą KNN  Comparable to the non-"Linear Models" on the small and large data sets ï‚Ą LDA  Needed rather large K to get good results  Had the best results on the small data set
  • 16. Final Results Model Training Score Testing Score KNN 1.000 0.147 Model Training Score Testing Score Decision Tree 1.000 0.342 Pay Rate Estimated Salary Model Training Score Testing Score LDA 0.585 0.286 Model Training Score Testing Score Random Forest 1.000 0.581 Smoothed Pay Rate Smoothed Estimated Salary
  • 18. How can we do better next time? ï‚Ą More data!  Either more data points or expand the parameters of the model ï‚Ą Keep playing with the shape of the data  Improve the ranges - 20k vs. 30k is more significant than 150k vs. 160k ï‚Ą Improve the quality of the data  Verify the LinkedIn data, Job Seniority, etc.
  • 19. Additional Thoughts Why I think a project like this is a marketing gimmick: ï‚Ą Only recruiters post expected salary ï‚Ą Too much variance in job titles and not enough in the job description ï‚Ą Only provides base salary and ignores bonus and non-cash compensation ï‚Ą Cannot handle deprecated skills or brand new skills
  • 20. References ï‚Ą projectSHERPA Homepage, http://projectsherpa.com/ ï‚Ą "Predict the salary of any UK job ad based on its contents", http://www.kaggle.com/c/job- salary-prediction ï‚Ą "Predicting advertised salaries", http://fastml.com/predicting-advertised- salaries/