Ds final project jwm

Job Salary Prediction
with Python
John Maiden
NYC Data Science Academy

How did I scrape the jobs?
Jobs were scraped from the following:
• Dice.com
• Monster.com
• Startupers.com
• Ventureloop.com
All credit goes to:
Craig Perler
CTO & Founder
projectSHERPA

Job Selection Process
 Iterative process based on keywords, text analysis using
NLTK, and manually reviewing the jobs
 Went from 70k jobs to final set of 585 based on the following
criteria:
 Keywords:
▪ (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR
▪ Job description contained: 'predictive modeling', 'data mining', 'text mining',
'machine learning', 'natural language processing') AND
▪ Job title did not contain: 'intern', 'internship'
 Skills
▪ Job had at least one tagged skill and did not contain the following: 'Salesforce',
'VBA', 'Sharepoint', 'Drupal'

Cleaning Up the Jobs Part 1
 Job data was good, but needed some further
cleaning
 Converted the posted pay rate text into a number
 Assigned job seniority ('junior', 'default', 'senior') based on job title
keywords
 Used python-linkedin to pull in more
company data
 Name, Description, Industry, Company Size, Company Type,
Specialities

Cleaning Up the Jobs Part 2
 Added more data columns
 Job Posted Year, Job Posted Month, Number of words in Job
Description, Number of characters in Job Description
 Converted text fields to numeric values
 Job Seniority, Employee Count, Company Type
 Converted Company Industry and Specialities to
binary valued columns
 Used only the following specialities: 'Big Data', 'Analytics', 'Machine
Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural
Language Processing', 'Predictive Analytics', 'Data Mining'

Modeling Time!
Problem
 Only 67 jobs (out of 585) have posted salaries!
Solution

What do we do?
Test using two different datasets
 Pay Rate - job salaries as provided by the poster
(67 records)
 Estimated Salary - job salaries as provided by a
separate model (584 records)

Model Selection
 Most of the competitors in the Kaggle competition
used Random Forest.
 Zygmunt Zając made a suggestion:

Initial Results
Tested with linear models and Random Forest
Model Training Score Testing Score
Ordinary Linear
Regression
0.795 -32,120,852.970
Ridge
Regression
0.669 -0.079
Lasso Regression -0.009 -0.021
Random Forest 1.000 0.147
Ordinary Linear
Regression
0.430 -330,266.154
Ridge
Regression
0.361 0.163
Lasso Regression 0.031 -0.051
Pay Rate Estimated Salary
Can we do better???

Changing the shape of the data
Original Data Log Data
Sqrt Data

Smoothing out the data
Our salary data is too granular - let's round to
units of 10k
Original Data Smoothed Data

Expanding the model set
 Random Forest is good but slow - try
Decision Tree to see if we can get
comparable results
 LDA - can use a classification model on the
smoothed data
 KNN - why not?

Reviewing the code
http://xkcd.com/221/

Model Results
 Linear Models (OLS, Ridge, Lasso)
 Universally poor on the small data set
 Of the Linear Models, Ridge was the best on the large data set
 Decision Tree/Random Forest
 Overfitted on the small data set (great training score, poor test score)
 Had the best results on the large data set (Random Forest)
 KNN
 Comparable to the non-"Linear Models" on the small and large data sets
 LDA
 Needed rather large K to get good results
 Had the best results on the small data set

Final Results
KNN 1.000 0.147
Decision Tree 1.000 0.342
Pay Rate Estimated Salary
LDA 0.585 0.286
Smoothed Pay Rate Smoothed Estimated Salary

Reviewing the data
http://blog.mindjet.com/2011/12/drowning-from-information-overload/

How can we do better next
time?
 More data!
 Either more data points or expand the parameters
of the model
 Keep playing with the shape of the data
 Improve the ranges - 20k vs. 30k is more
significant than 150k vs. 160k
 Improve the quality of the data
 Verify the LinkedIn data, Job Seniority, etc.

Additional Thoughts
Why I think a project like this is a marketing gimmick:
 Only recruiters post expected salary
 Too much variance in job titles and not enough in
the job description
 Only provides base salary and ignores bonus and
non-cash compensation
 Cannot handle deprecated skills or brand new skills

References
 projectSHERPA Homepage,
http://projectsherpa.com/
 "Predict the salary of any UK job ad based on
its contents", http://www.kaggle.com/c/job-
salary-prediction
 "Predicting advertised salaries",
http://fastml.com/predicting-advertised-
salaries/

Ds final project jwm

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Ds final project jwm

Ähnlich wie Ds final project jwm (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ds final project jwm