SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
http://bit.ly/wia-2019
Using Embeddings to Understand the
Variance and Evolution of
Data Science Skill Sets
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit.co
Jobs are hard to 

categorize Research Analyst
MMP - New York City
We have an exciting opportunity for an individual
to join MMP’s Cyber Risk Group. The successful
candidate will have the ability to shape our
investors service strategy, analytic, research and
outreach framework for cyber risk and its
relationship to credit and the financial markets…
Research Analyst
DRG - New York City
We are seeking a Research Analyst to join our
team which serves our pharmaceutical clients
with actionable data. As a Research Analyst, you
will provide data support for client-facing
platforms, presentations, and client requests…
Entry-Level
Senior
Research at TapRecruit
Helping companies make fairer and more efficient recruiting decisions
NLP and Data Science: Decision Science:
- What are distinguishing characteristics of
successful career documents?

- What skills are increasingly important for
different industries?

- How do candidates make decisions about
which jobs to apply to?

- How do hiring teams make decisions
about candidate qualifications?

How have data science skills
changed over time?
Strategies to identify changes in texts
Manual Feature Extraction
Require selection of key
attributes, therefore difficult to
discover new attributes
MBA
PhD
SQL
Tableau
PowerBIPython
Adapted from Blei and Lafferty, ICML 2006.
1880
force
energy
motion
differ
light
1960
radiat
energy
electron
measure
ray
2000
state
energy
electron
magnet
field
1920
atom
theory
electron
energy
measure
Matter
Electron
Quantum
Dynamic Topic Models
Require experimentation with
topic number
Traditional approaches do not capture syntactic and semantic shifts
Embeddings use context to extract meaning
Proficiency programming in Python, Java or C++.
Word ContextContext
Experience in Python, Java or other object-oriented programming languages
WordContext Context
Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python)
WordContext
Window sizes capture semantic similarity vs semantic relatedness
A simplified representation of word vectorsVocabulary
Vocabulary
Dimensions
(50-300 d)
Dimension
Reduction
French
German
Japanese
Esperanto
Language
Python
Programming
C++
Java
Object-
oriented
Dimension
Reduction
Tokens in corpus
(Millions or Billions)
Dimension reduction is key to all types of embeddings models
Embeddings capture entity relationships
Adapted from Stanford NLP GLoVE Project
McAdam
Colao
VodafoneVerizon
Viacom
Dauman
Exxon Tillerson
Wal-Mart McMillon
Hierarchies
SlowestSlower
Slow Shortest
Shorter
Strongest
Stronger
Strong
Short
Comparatives and Superlatives
Man
Woman
Queen
King
Man :: King as Woman :: ?
Dimensionality enables comparison between word pairs along many axes
Man
Woman
Queen
King
Man :: King as Woman :: ?
Embeddings reflect cultural bias in corpora
Adapted from Bolukbasi et al., arXiv: 1607:06520.
High dimensionality enables some bias reduction
Man
Woman
Homemaker
Computer
Programmer
Man :: Programmer as Woman :: ?
Pretrained embeddings facilitate fast prototyping
Final Application
Corpus Twitter Common Crawl GoogleNews Wikipedia
Tokens 27 B 42-840 B 100 B 6 B
Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k
Algorithm GLoVE GLoVE word2vec GLoVE
Vector Length 25 - 200 d 300 d 300 d 50 - 300 d
Corpus Generation
Corpus Processing
Language Model
Generation
Language Model
Tuning
Embeddings training should match corpus that is being tested on
Problems with pretrained embedding models
Casing
Abbreviations vs Words
e.g. IT vs it
Out of Vocabulary Words Domain Specific Words & Acronyms
Polysemy
Words with multiple meanings
e.g. drive (a car) vs drive (results)
e.g. Chef (the job) vs Chef (the language)
Multi-word Expressions
Phrases that have new meanings
e.g. Front-end vs front + end
Custom language models tools
Modularized for different data and modeling requirements
CoreNLP SyntaxNet
Corpus Processing
Tokenization, POS tagging, Sentence
Segmentation, Dependency Parsing
Language Modeling
Different word embedding models
(GLoVE, word2vec, fastText)
Career language embedding model
Identified equal opportunity and perks language
Career language embedding model
Identified 'soft' skills and language around experience
I’ve got 300 dimensions…
but time ain’t one
Two approaches to connect embeddings
2015
2016
2017
2018
Dynamic embeddings
trained together
Data efficient
Does not require alignment
Balmer and Mandt, arXiv: 1702:08359
Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607
Rudolph and Blei, arXiv: 1703:08052
2016
2017
2018
2015
Static embeddings
stitched together
Data hungry
Requires alignment
Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515.
Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.

Dynamic Bernoulli embeddings
Repository Link: http://bit.ly/dyn_bern_emb
Outputs facilitate quick analysis of trends
Absolute drift
Identifies top words whose usage
changes over time course
Embedding neighborhoods
Extract semantic changes by nearest
neighbors of drifting words
Experiments with dynamic embeddings
Small Corpus Large Corpus
Job Types All All
Time Slices
3 

(2016-2018)
3 

(2016-2018)
Number of
Documents
50 k 500 k
Vocabulary Size 10 k 10 k
Data Preprocessing Basic Basic
Embedding
Dimensions
100 d 100 d
Dynamic embeddings
Small corpus identified gains and losses
Demand for PhDs and MBAs is Falling
MBAs in All Jobs
-35%
PhDs in All Jobs
-23%
PhDs in DS Jobs
-30%
Data Science skills showing significant shifts
Tableau
+20%
PowerBI
+100%
Spark
Steady
Hadoop
-30%
Python
Steady
Perl
-40%
Blue boxes indicate phrases identified from top drifting words analysis. 

Grey boxes indicate ‘control’ skills.
Dynamic embeddings
Large corpus identified role-type dependent shifts in requirements
FP&A Roles
+70%
FinTech Roles
Steady
Sales Roles
Steady
BizDev Roles
+50%
Marketing Roles
Steady
HR Roles
+25%
0%
1.5%
3%
4.5%
6%
2016 2017 2018
No change to SQL demand SQL requirement increases in specific functions
Blue boxes indicate phrases identified from top drifting words analysis. 

Grey boxes indicate ‘control’ skills.
How have data science skills changed over time?
- Flavors of static word embeddings: The Corpus Issue
- Considerations for developing custom embedding models
- Flavors of dynamic models: Dynamic Bernoulli embeddings
http://bit.ly/wia-2019
Thank you Women in Analytics!
Maryam Jahanshahi Ph.D.
Research Scientist
@mjahanshahi
in maryam-j

Weitere ähnliche Inhalte

Ähnlich wie WIA 2019 - Using Embeddings to Understand the Variance and Evolution of Data Science Skill Sets

State of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationState of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentation
ssuser2750ef
 
Lean & Agile Project Manaagement: Its Leadership Considerations
Lean & Agile Project Manaagement: Its Leadership ConsiderationsLean & Agile Project Manaagement: Its Leadership Considerations
Lean & Agile Project Manaagement: Its Leadership Considerations
David Rico
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
Blogtalk 2008
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
renjan131
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
M phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projectsM phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projects
Vijay Karan
 
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasLean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
David Rico
 
Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011
sssw2011
 

Ähnlich wie WIA 2019 - Using Embeddings to Understand the Variance and Evolution of Data Science Skill Sets (20)

Melbourne Microservices Meetup: Agenda for a new Architecture
Melbourne Microservices Meetup: Agenda for a new ArchitectureMelbourne Microservices Meetup: Agenda for a new Architecture
Melbourne Microservices Meetup: Agenda for a new Architecture
 
BioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogueBioIT Europe 2010 - BioCatalogue
BioIT Europe 2010 - BioCatalogue
 
HPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & Analytics
 
Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...
Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...
Migrating to Alfresco Part II: The “How” – Tools & Best Practices for Renovat...
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
Copy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxCopy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptx
 
State of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationState of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentation
 
Lean & Agile Project Manaagement: Its Leadership Considerations
Lean & Agile Project Manaagement: Its Leadership ConsiderationsLean & Agile Project Manaagement: Its Leadership Considerations
Lean & Agile Project Manaagement: Its Leadership Considerations
 
Oct 24 Emerald Planet Tv Interview
Oct 24 Emerald Planet Tv InterviewOct 24 Emerald Planet Tv Interview
Oct 24 Emerald Planet Tv Interview
 
Spivack Blogtalk 2008
Spivack Blogtalk 2008Spivack Blogtalk 2008
Spivack Blogtalk 2008
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
The Nature of Information
The Nature of InformationThe Nature of Information
The Nature of Information
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
Semantics in Financial Services -David Newman
Semantics in Financial Services -David NewmanSemantics in Financial Services -David Newman
Semantics in Financial Services -David Newman
 
M phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projectsM phil-computer-science-data-mining-projects
M phil-computer-science-data-mining-projects
 
M.Phil Computer Science Data Mining Projects
M.Phil Computer Science Data Mining ProjectsM.Phil Computer Science Data Mining Projects
M.Phil Computer Science Data Mining Projects
 
Business in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for IntegrationBusiness in the Driver’s Seat – An Improved Model for Integration
Business in the Driver’s Seat – An Improved Model for Integration
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation of
 
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular IdeasLean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
Lean & Agile Organizational Leadership: History, Theory, Models, & Popular Ideas
 
Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011Jim Hendler's Presentation at SSSW 2011
Jim Hendler's Presentation at SSSW 2011
 

Mehr von Women in Analytics Conference

Mehr von Women in Analytics Conference (12)

WIA 2019 - Enhancing AI in Wearable Devices using Topological Data Analysis
WIA 2019 - Enhancing AI in Wearable Devices using Topological Data AnalysisWIA 2019 - Enhancing AI in Wearable Devices using Topological Data Analysis
WIA 2019 - Enhancing AI in Wearable Devices using Topological Data Analysis
 
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
WIA 2019 - Unearth the Journey of Implementing Vision Based Deep Learning Sol...
 
WIA 2019 - Ethics in Algorithms Panel (Emily Schlesinger)
WIA 2019 - Ethics in Algorithms Panel (Emily Schlesinger)WIA 2019 - Ethics in Algorithms Panel (Emily Schlesinger)
WIA 2019 - Ethics in Algorithms Panel (Emily Schlesinger)
 
2019 WIA - The Importance of Ethics in Data Science
2019 WIA - The Importance of Ethics in Data Science2019 WIA - The Importance of Ethics in Data Science
2019 WIA - The Importance of Ethics in Data Science
 
WIA 2019 - Ethics in Algorithms Panel (Nicole Alexander)
WIA 2019 - Ethics in Algorithms Panel (Nicole Alexander)WIA 2019 - Ethics in Algorithms Panel (Nicole Alexander)
WIA 2019 - Ethics in Algorithms Panel (Nicole Alexander)
 
2019 WIA - Setting Expectations on Data Science Projects
2019 WIA - Setting Expectations on Data Science Projects2019 WIA - Setting Expectations on Data Science Projects
2019 WIA - Setting Expectations on Data Science Projects
 
2019 WIA - Analytics Maturity: The Power to be Incredible
2019 WIA - Analytics Maturity: The Power to be Incredible2019 WIA - Analytics Maturity: The Power to be Incredible
2019 WIA - Analytics Maturity: The Power to be Incredible
 
2019 WIA - Diversity in Analytics
2019 WIA - Diversity in Analytics2019 WIA - Diversity in Analytics
2019 WIA - Diversity in Analytics
 
2019 WIA - User-centric Design for Data Scientists
2019 WIA - User-centric Design for Data Scientists2019 WIA - User-centric Design for Data Scientists
2019 WIA - User-centric Design for Data Scientists
 
2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements2019 WIA - Data-Driven Product Improvements
2019 WIA - Data-Driven Product Improvements
 
WIA 2019 - Steering Model Selection with Visual Diagnostics
WIA 2019 - Steering Model Selection with Visual DiagnosticsWIA 2019 - Steering Model Selection with Visual Diagnostics
WIA 2019 - Steering Model Selection with Visual Diagnostics
 
WIA 2019 - From Academia to Industry
WIA 2019 - From Academia to IndustryWIA 2019 - From Academia to Industry
WIA 2019 - From Academia to Industry
 

Kürzlich hochgeladen

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

WIA 2019 - Using Embeddings to Understand the Variance and Evolution of Data Science Skill Sets

  • 1. http://bit.ly/wia-2019 Using Embeddings to Understand the Variance and Evolution of Data Science Skill Sets Maryam Jahanshahi Ph.D. Research Scientist TapRecruit.co
  • 2. Jobs are hard to 
 categorize Research Analyst MMP - New York City We have an exciting opportunity for an individual to join MMP’s Cyber Risk Group. The successful candidate will have the ability to shape our investors service strategy, analytic, research and outreach framework for cyber risk and its relationship to credit and the financial markets… Research Analyst DRG - New York City We are seeking a Research Analyst to join our team which serves our pharmaceutical clients with actionable data. As a Research Analyst, you will provide data support for client-facing platforms, presentations, and client requests… Entry-Level Senior
  • 3. Research at TapRecruit Helping companies make fairer and more efficient recruiting decisions NLP and Data Science: Decision Science: - What are distinguishing characteristics of successful career documents?
 - What skills are increasingly important for different industries?
 - How do candidates make decisions about which jobs to apply to?
 - How do hiring teams make decisions about candidate qualifications?

  • 4.
  • 5. How have data science skills changed over time?
  • 6. Strategies to identify changes in texts Manual Feature Extraction Require selection of key attributes, therefore difficult to discover new attributes MBA PhD SQL Tableau PowerBIPython Adapted from Blei and Lafferty, ICML 2006. 1880 force energy motion differ light 1960 radiat energy electron measure ray 2000 state energy electron magnet field 1920 atom theory electron energy measure Matter Electron Quantum Dynamic Topic Models Require experimentation with topic number Traditional approaches do not capture syntactic and semantic shifts
  • 7. Embeddings use context to extract meaning Proficiency programming in Python, Java or C++. Word ContextContext Experience in Python, Java or other object-oriented programming languages WordContext Context Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python) WordContext Window sizes capture semantic similarity vs semantic relatedness
  • 8. A simplified representation of word vectorsVocabulary Vocabulary Dimensions (50-300 d) Dimension Reduction French German Japanese Esperanto Language Python Programming C++ Java Object- oriented Dimension Reduction Tokens in corpus (Millions or Billions) Dimension reduction is key to all types of embeddings models
  • 9. Embeddings capture entity relationships Adapted from Stanford NLP GLoVE Project McAdam Colao VodafoneVerizon Viacom Dauman Exxon Tillerson Wal-Mart McMillon Hierarchies SlowestSlower Slow Shortest Shorter Strongest Stronger Strong Short Comparatives and Superlatives Man Woman Queen King Man :: King as Woman :: ? Dimensionality enables comparison between word pairs along many axes
  • 10. Man Woman Queen King Man :: King as Woman :: ? Embeddings reflect cultural bias in corpora Adapted from Bolukbasi et al., arXiv: 1607:06520. High dimensionality enables some bias reduction Man Woman Homemaker Computer Programmer Man :: Programmer as Woman :: ?
  • 11. Pretrained embeddings facilitate fast prototyping Final Application Corpus Twitter Common Crawl GoogleNews Wikipedia Tokens 27 B 42-840 B 100 B 6 B Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k Algorithm GLoVE GLoVE word2vec GLoVE Vector Length 25 - 200 d 300 d 300 d 50 - 300 d Corpus Generation Corpus Processing Language Model Generation Language Model Tuning Embeddings training should match corpus that is being tested on
  • 12. Problems with pretrained embedding models Casing Abbreviations vs Words e.g. IT vs it Out of Vocabulary Words Domain Specific Words & Acronyms Polysemy Words with multiple meanings e.g. drive (a car) vs drive (results) e.g. Chef (the job) vs Chef (the language) Multi-word Expressions Phrases that have new meanings e.g. Front-end vs front + end
  • 13. Custom language models tools Modularized for different data and modeling requirements CoreNLP SyntaxNet Corpus Processing Tokenization, POS tagging, Sentence Segmentation, Dependency Parsing Language Modeling Different word embedding models (GLoVE, word2vec, fastText)
  • 14. Career language embedding model Identified equal opportunity and perks language
  • 15. Career language embedding model Identified 'soft' skills and language around experience
  • 16. I’ve got 300 dimensions… but time ain’t one
  • 17. Two approaches to connect embeddings 2015 2016 2017 2018 Dynamic embeddings trained together Data efficient Does not require alignment Balmer and Mandt, arXiv: 1702:08359 Yao, Sun, Ding, Rao and Xiong, arXiv: 1703:00607 Rudolph and Blei, arXiv: 1703:08052 2016 2017 2018 2015 Static embeddings stitched together Data hungry Requires alignment Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.

  • 18. Dynamic Bernoulli embeddings Repository Link: http://bit.ly/dyn_bern_emb Outputs facilitate quick analysis of trends Absolute drift Identifies top words whose usage changes over time course Embedding neighborhoods Extract semantic changes by nearest neighbors of drifting words
  • 19. Experiments with dynamic embeddings Small Corpus Large Corpus Job Types All All Time Slices 3 
 (2016-2018) 3 
 (2016-2018) Number of Documents 50 k 500 k Vocabulary Size 10 k 10 k Data Preprocessing Basic Basic Embedding Dimensions 100 d 100 d
  • 20. Dynamic embeddings Small corpus identified gains and losses Demand for PhDs and MBAs is Falling MBAs in All Jobs -35% PhDs in All Jobs -23% PhDs in DS Jobs -30% Data Science skills showing significant shifts Tableau +20% PowerBI +100% Spark Steady Hadoop -30% Python Steady Perl -40% Blue boxes indicate phrases identified from top drifting words analysis. 
 Grey boxes indicate ‘control’ skills.
  • 21. Dynamic embeddings Large corpus identified role-type dependent shifts in requirements FP&A Roles +70% FinTech Roles Steady Sales Roles Steady BizDev Roles +50% Marketing Roles Steady HR Roles +25% 0% 1.5% 3% 4.5% 6% 2016 2017 2018 No change to SQL demand SQL requirement increases in specific functions Blue boxes indicate phrases identified from top drifting words analysis. 
 Grey boxes indicate ‘control’ skills.
  • 22. How have data science skills changed over time? - Flavors of static word embeddings: The Corpus Issue - Considerations for developing custom embedding models - Flavors of dynamic models: Dynamic Bernoulli embeddings
  • 23. http://bit.ly/wia-2019 Thank you Women in Analytics! Maryam Jahanshahi Ph.D. Research Scientist @mjahanshahi in maryam-j