3. How did I scrape the jobs?
Jobs were scraped from the following:
âą Dice.com
âą Monster.com
âą Startupers.com
âą Ventureloop.com
All credit goes to:
Craig Perler
CTO & Founder
projectSHERPA
4. Job Selection Process
ïĄ Iterative process based on keywords, text analysis using
NLTK, and manually reviewing the jobs
ïĄ Went from 70k jobs to final set of 585 based on the following
criteria:
ï§ Keywords:
âȘ (Job title contained: 'data science', 'data scientist', 'statistical', 'statistician' OR
âȘ Job description contained: 'predictive modeling', 'data mining', 'text mining',
'machine learning', 'natural language processing') AND
âȘ Job title did not contain: 'intern', 'internship'
ï§ Skills
âȘ Job had at least one tagged skill and did not contain the following: 'Salesforce',
'VBA', 'Sharepoint', 'Drupal'
5. Cleaning Up the Jobs Part 1
ïĄ Job data was good, but needed some further
cleaning
ï§ Converted the posted pay rate text into a number
ï§ Assigned job seniority ('junior', 'default', 'senior') based on job title
keywords
ïĄ Used python-linkedin to pull in more
company data
ï§ Name, Description, Industry, Company Size, Company Type,
Specialities
6. Cleaning Up the Jobs Part 2
ïĄ Added more data columns
ï§ Job Posted Year, Job Posted Month, Number of words in Job
Description, Number of characters in Job Description
ïĄ Converted text fields to numeric values
ï§ Job Seniority, Employee Count, Company Type
ïĄ Converted Company Industry and Specialities to
binary valued columns
ï§ Used only the following specialities: 'Big Data', 'Analytics', 'Machine
Learning', 'analytics', 'Data Science', 'Big Data Analytics', 'Natural
Language Processing', 'Predictive Analytics', 'Data Mining'
8. What do we do?
Test using two different datasets
ï§ Pay Rate - job salaries as provided by the poster
(67 records)
ï§ Estimated Salary - job salaries as provided by a
separate model (584 records)
9. Model Selection
ïĄ Most of the competitors in the Kaggle competition
used Random Forest.
ïĄ Zygmunt ZajÄ c made a suggestion:
10. Initial Results
Tested with linear models and Random Forest
Model Training Score Testing Score
Ordinary Linear
Regression
0.795 -32,120,852.970
Ridge
Regression
0.669 -0.079
Lasso Regression -0.009 -0.021
Random Forest 1.000 0.147
Model Training Score Testing Score
Ordinary Linear
Regression
0.430 -330,266.154
Ridge
Regression
0.361 0.163
Lasso Regression 0.031 -0.051
Random Forest 1.000 0.325
Pay Rate Estimated Salary
Can we do better???
12. Smoothing out the data
Our salary data is too granular - let's round to
units of 10k
Original Data Smoothed Data
13. Expanding the model set
ïĄ Random Forest is good but slow - try
Decision Tree to see if we can get
comparable results
ïĄ LDA - can use a classification model on the
smoothed data
ïĄ KNN - why not?
15. Model Results
ïĄ Linear Models (OLS, Ridge, Lasso)
ï§ Universally poor on the small data set
ï§ Of the Linear Models, Ridge was the best on the large data set
ïĄ Decision Tree/Random Forest
ï§ Overfitted on the small data set (great training score, poor test score)
ï§ Had the best results on the large data set (Random Forest)
ïĄ KNN
ï§ Comparable to the non-"Linear Models" on the small and large data sets
ïĄ LDA
ï§ Needed rather large K to get good results
ï§ Had the best results on the small data set
16. Final Results
Model Training Score Testing Score
KNN 1.000 0.147
Model Training Score Testing Score
Decision Tree 1.000 0.342
Pay Rate Estimated Salary
Model Training Score Testing Score
LDA 0.585 0.286
Model Training Score Testing Score
Random Forest 1.000 0.581
Smoothed Pay Rate Smoothed Estimated Salary
18. How can we do better next
time?
ïĄ More data!
ï§ Either more data points or expand the parameters
of the model
ïĄ Keep playing with the shape of the data
ï§ Improve the ranges - 20k vs. 30k is more
significant than 150k vs. 160k
ïĄ Improve the quality of the data
ï§ Verify the LinkedIn data, Job Seniority, etc.
19. Additional Thoughts
Why I think a project like this is a marketing gimmick:
ïĄ Only recruiters post expected salary
ïĄ Too much variance in job titles and not enough in
the job description
ïĄ Only provides base salary and ignores bonus and
non-cash compensation
ïĄ Cannot handle deprecated skills or brand new skills