Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Using Embeddings to Understand 

the Variance and Evolution of
Data Science Skill Sets
Maryam Jahanshahi Ph.D.
Research Sc...
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb
Using Embeddings to Understand 

the Var...
TapRecruit uses NLP to understand and 

organize natural language career content
Smart Editor 

for Job Descriptions
Activ...
Language matters in job descriptions
Same title,
Different job
Finance Manager
Kraft Foods
Finance Manager
Kraft Foods
Sam...
How have data science skills
changed over time?
How have data science skills changed over time?
Manual Feature Extraction Dynamic Topic Modeling
Adapted from Blei and Laf...
Word embeddings capture semantic similarities
Proficiency programming in Python, Java or C++.
Word ContextContext
Experienc...
Embeddings capture entity relationships
Adapted from Stanford NLP GLoVE Project
Woman :: Queen as Man :: ?
Man
Woman
Queen...
Pretrained embeddings facilitate fast prototyping
Final Application
Corpus Twitter Common Crawl GoogleNews Wikipedia
Token...
Problems with pretrained embedding models
Casing
Abbreviations vs Words
e.g. IT vs it
Out of Vocabulary Words Domain Speci...
Tools for developing custom language models
Tokenization, POS tagging, Sentence
Segmentation, Dependency Parsing
Corpus Pr...
Windows capture semantic similarity vs relatedness
Captures Semantic similarity, Substitutes
and Word-level differences
Sm...
Custom embeddings identified equal opportunity and
perks language
Custom embeddings identified ‘soft’ skills and
language around experience
I’ve got 300 dimensions…
but time ain’t one
Two flavors of dynamic embeddings
Trained Together
2015
2016
2017
2018
Balmer and Mandt, arXiv: 1702:08359.
Rudolph and Ble...
The benefits of dynamic Bernoulli embeddings
Data efficient: Treats each time slice
as a sequential latent variable,
enablin...
Demand for MBAs and PhDs falling
Data Science JobsAll Jobs
0
0.5
1
2016 2017 2018
0
0.5
1
2016 2017 2018
MBA
PhD
PhD
Data Science skills showing significant shifts
Tableau and PowerBI
0
1
2
3
2016 2017 2018
Tableau
PowerBI
Python vs Perl
20...
regression :: Generalized Linear Models as
word2vec :: Exponential Family Embeddings
Members of the Exponential Family of Embeddings
Binary Data
Bernoulli Embedding
Count or Ordinal Data
Poisson Embedding
Co...
Members of the Exponential Family of Embeddings
Binary Data
Bernoulli Embedding
Context
Datapoin
t
Context
Proficiency
prog...
Poisson embeddings capture item similarities from
shopper behavior
Maruchan chicken ramen High Inner Product Combinations
...
How have data science skills changed over time?
- Flavors of static word embeddings: The Corpus Issue
- Considerations for...
Thank you PyData NYC!
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb
Thank you PyData NYC!
Maryam Jahanshahi Ph.D.
Research Scientist
TapRecruit
http://bit.ly/pydatanyc-emb
Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi
Nächste SlideShare
Wird geladen in …5
×

Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi

413 Aufrufe

Veröffentlicht am

In this talk I will discuss exponential family embeddings, which are methods that extend the idea behind word embeddings to other data types. I will describe how we used dynamic embeddings to understand how data science skill-sets have transformed over the last 3 years using our large corpus of jobs. The key takeaway is that these models can enrich analysis of specialized datasets.

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

Using Embeddings to Understand the Variance and Evolution of Data Science... - Maryam Jahanshahi

  1. 1. Using Embeddings to Understand 
 the Variance and Evolution of Data Science Skill Sets Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb
  2. 2. Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb Using Embeddings to Understand 
 the Variance and Evolution of Data Science Skill Sets
  3. 3. TapRecruit uses NLP to understand and 
 organize natural language career content Smart Editor 
 for Job Descriptions Active Pipeline 
 Health Monitoring Multifaceted Salary 
 Estimation
  4. 4. Language matters in job descriptions Same title, Different job Finance Manager Kraft Foods Finance Manager Kraft Foods Same Title Junior (3 Years) No Managerial Experience Senior (6-8 Years) Division Level Controller Strategic Finance Role MBA / CPA Required Experience Required Responsibility Preferred Skill Required Education Different title, Same job Performance 
 Marketing Manager PocketGems Senior Analyst, Customer Strategy The Gap Mid-Level Quantitative Focus Expertise iBanking Data Analysis Tools (SQL) Consulting Experience Preferred MBA Preferred Mid-Level Quantitative Focus Expertise iBanking Relational Database Experience External Consulting Experience Preferred BA degree in business, finance, MBA Preferred Required Experience Required Skills Required Experience Required Skills Preferred Experience Required and 
 Preferred Education
  5. 5. How have data science skills changed over time?
  6. 6. How have data science skills changed over time? Manual Feature Extraction Dynamic Topic Modeling Adapted from Blei and Lafferty, ICML 2006. 1880 force energy motion differ light 1960 radiat energy electron measure ray 2000 state energy electron magnet field 1920 atom theory electron energy measure Matter Electron Quantum MBA PhD SQL Tableau PowerBIPython
  7. 7. Word embeddings capture semantic similarities Proficiency programming in Python, Java or C++. Word ContextContext Experience in Python, Java or other object-oriented programming languages WordContext Context Statistical modeling through software (e.g. SPSS) or programming language (e.g. Python) WordContext French German Japanese Esperanto Language Python Programming C++ Java Object- orientated
  8. 8. Embeddings capture entity relationships Adapted from Stanford NLP GLoVE Project Woman :: Queen as Man :: ? Man Woman Queen King Hierarchies McAdamCola o VodafoneVerizon Viacom Dauman Exxon Tillerson Wal-Mart McMillon Comparatives 
 and Superlatives SlowestSlower Slow Shortest Shorter Strongest Stronge r Strong Short
  9. 9. Pretrained embeddings facilitate fast prototyping Final Application Corpus Twitter Common Crawl GoogleNews Wikipedia Tokens 27 B 42-840 B 100 B 6 B Vocabulary Size 1.2 M 1.9-2.2 M 3 M 400 k Algorithm GLoVE GLoVE word2vec GLoVE Vector Length 25 - 200 d 300 d 300 d 50 - 300 d Corpus Generation Corpus Processing Language Model Generation Language Model Tuning
  10. 10. Problems with pretrained embedding models Casing Abbreviations vs Words e.g. IT vs it Out of Vocabulary Words Domain Specific Words & Acronyms Polysemy Words with multiple meanings e.g. drive (a car) vs drive (results) e.g. Chef (the job) vs Chef (the language) Multi-word Expressions Phrases that have new meanings e.g. Front-end vs front + end
  11. 11. Tools for developing custom language models Tokenization, POS tagging, Sentence Segmentation, Dependency Parsing Corpus Processing CoreNLP SyntaxNet Different word embedding models (GLoVE, word2vec, fastText) Language Modeling
  12. 12. Windows capture semantic similarity vs relatedness Captures Semantic similarity, Substitutes and Word-level differences Small Window Size Captures Semantic relatedness, Alternatives and Domain-level differences Large Window Size Python Java Programming C++ Language French German Japanese Esperanto SPSS Statistical modeling Object-orientated SoftwarePython Programming C++ JavaLanguage French German Japanese Esperanto Object- orientated
  13. 13. Custom embeddings identified equal opportunity and perks language
  14. 14. Custom embeddings identified ‘soft’ skills and language around experience
  15. 15. I’ve got 300 dimensions… but time ain’t one
  16. 16. Two flavors of dynamic embeddings Trained Together 2015 2016 2017 2018 Balmer and Mandt, arXiv: 1702:08359. Rudolph and Blei, arXiv: 1703:08052. Independently Trained 2016 2017 2018 2015 Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.
  17. 17. The benefits of dynamic Bernoulli embeddings Data efficient: Treats each time slice as a sequential latent variable, enabling time slices with sparse data. Does not require stitching/ alignment: Treating time slice as a variable ensures embeddings are connected across slices. Dynamic embeddings Balmer and Mandt, arXiv: 1702:08359. Rudolph and Blei, arXiv: 1703:08052. Data hungry: Sufficient data for each time slice for a quality embedding. Requires stitching: Each time slice is trained independently, therefore dimensions are not comparable across slices. Static embeddings Kim, Chiu, Kaneki, Hedge and Petrov, arXiv: 1405:3515. Kulkarni, Al-Rfou, Perozzi and Skiena, arXiv: 1411:3315.

  18. 18. Demand for MBAs and PhDs falling Data Science JobsAll Jobs 0 0.5 1 2016 2017 2018 0 0.5 1 2016 2017 2018 MBA PhD PhD
  19. 19. Data Science skills showing significant shifts Tableau and PowerBI 0 1 2 3 2016 2017 2018 Tableau PowerBI Python vs Perl 2016 2017 2018 Python Perl 1 0 Hadoop vs Spark 2016 2017 2018 Hadoop Spark 1 0
  20. 20. regression :: Generalized Linear Models as word2vec :: Exponential Family Embeddings
  21. 21. Members of the Exponential Family of Embeddings Binary Data Bernoulli Embedding Count or Ordinal Data Poisson Embedding Context Datapoint Context Mini Bagels Cream cheese Milk Coffee Orange Juice Continuous Data Gaussian Embedding Context Datapoint Context JFK-CDG LGA-DCA JFK-DFW LAX-JFK LAX-LGA Context Datapoin t Context Proficiency programming Python Java C++
  22. 22. Members of the Exponential Family of Embeddings Binary Data Bernoulli Embedding Context Datapoin t Context Proficiency programming Python Java C++ Count or Ordinal Data Context Datapoint Context 10 Guidelines for A/B Testing / Emily Robinson The Value of Null Results / Angel D’az Words in Space / Rebecca Bilbro … / James Powell Why I Use Julia / Katharine Hyatt Poisson Embedding
  23. 23. Poisson embeddings capture item similarities from shopper behavior Maruchan chicken ramen High Inner Product Combinations Maruchan creamy chicken ramen Maruchan oriental flavor ramen Maruchan roast chicken ramen Old Dutch potato chips & Budweiser Lager beer Lays potato chips & DiGiorno frozen pizza Yoplait strawberry yogurt Low Inner Product Combinations Yoplait apricot mango yogurt Yoplait strawberry orange smoothie Yoplait strawberry banana yogurt General Mills cinnamon toast & Tide Plus detergent Beef Swanson Broth soup & Campbell Soup cans Adapted from Rudolph, Ruiz, Mandt and Blei, arXiv: 1608.00778.
  24. 24. How have data science skills changed over time? - Flavors of static word embeddings: The Corpus Issue - Considerations for developing custom embedding models - Flavors of dynamic models: Dynamic Bernoulli embeddings - Other members of the Exponential Family of Embeddings
  25. 25. Thank you PyData NYC! Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb
  26. 26. Thank you PyData NYC! Maryam Jahanshahi Ph.D. Research Scientist TapRecruit http://bit.ly/pydatanyc-emb

×