SlideShare ist ein Scribd-Unternehmen logo
1 von 18
LinkedIn Skills: Large-Scale Topic Extraction
and Inference
Mathieu Bastian
LinkedIn Corporation ©2014 All Rights Reserved
The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
313M+ 3M+
Company Pages
Connecting Talent  Opportunity. At scale…
LinkedIn Profile
 313M+ profiles in 200+ countries
 Organized into sections
– Standardized: Companies, Titles, Industry,
Location etc.
– Unstandardized: Text (Summary, Position
description, specialties)
 Skills & Endorsements section
– Introduced in 2011
– Limited to 50 skills per profile
Skills at LinkedIn
 Key component of the
professional identity
 Dictionary of 45k+ skills in
English
 Members have diverse skills
– Java Programming
– Ballet
– Politics
– Bow Hunting
 Many of these are long-tailExample of a Skills section on a LinkedIn profile
Folksonomy creation
LinkedIn Corporation ©2014 All Rights Reserved
Folksonomy creation
 Create a folksonomy of skills based on LinkedIn profiles
 Leverage the “specialties” section
 Detect comma-separated lists and extract skill phrases
 Use stop-list and exclude other entities (e.g. companies, titles,
degrees)
 150k skill phrases extracted after removing long-tail noise
skill
phrases
Disambiguation
 Need to add context to differentiate skill phrases with multiple
meanings (e.g. NLP = Natural Language Processing,
NLP = Neuro-linguistic programming)
 Different meanings have different sets of related phrases
 Use Jaccard Similarity on LinkedIn profiles for related phrases and
then SVD + KMeans to identify clusers of phrases
References: R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463
De-duplication
 Need to group phrases with similar meaning together. Examples:
– Acronyms: B2B, Business to Business
– Synonyms: Java Programming, Java Development
– Typos: Government Liason
 Many of the skill phrases could be tied to a Wikipedia page
 Built Mechanical Turk (www.mturk.com) task to find the Wikipedia
page associated with a skill phrase
Java programming
Java development
Java
http://en.wikipedia.org/wiki/Java
_(programming_language)
Cluster
 Extraction based on 12M of LinkedIn profiles with “specialties”
 Extracted 150k skill phrases
 Clustered related phrases adding the industry context to ambiguous
phrases
 De-duplication using MTurk
 Final master list contains 50k skills
Folksonomy creation summary
Examples of synonyms of
“Microsoft Office”
Inference and Recommendation
LinkedIn Corporation ©2014 All Rights Reserved
 Goal was boosting skills adoption with a recommender system:
“suggested skills”
 Inferring the skills members have, similar to discovering latent
attributes in profiles
 Develop a collaborative filtering solution using profile attributes
Skills Inference and Recommendation
References: A. Mislove and al. You are who you know: Inferring user profiles in online social networks.
R. Jäschke and al. Tag recommendations in folksonomies.
Skills Typeahead on LinkedIn
Suggested Skills
 Large number of standardized profile attributes (i.e. can be
represented by a unique identifier)
 Members with similar profiles attributes are likely to have similar
skills (e.g. If you work at Apple, you probably know “Mac OS”)
Features
Type Example Cardinality
Title (Headline) Product Manager Thousands
Function Engineering Dozens
Industry Healthcare Dozens
Title (Employment Position) Product Manager Thousands
Company LinkedIn Millions
Group membership Healthcare Professionals Millions
Skills Matlab Thousands
 Calculate the likelihood that a member has a given
skill, given his profile attributes
 No direct user similarity metric
 Large number of features (e.g. 3M companies) and 50k classes
Problem
the set of profile attributes
the folksonomy of skills
 Used a Naïve Bayes Classifier to produce inferred skills
 Training data based on members already with skills
 Result is a ranking of inferred skills, which can directly be used in
“suggested skills”
 Evaluation methodology
– AUC for each skill
– P@k and Recall for evaluating the recommendations
Naïve Bayes Classifier
with
 Evaluate how well we can predict skills members’ have
Evaluation
ROC of skill “Hadoop” Distribution of ROC across
all skills
 12X improvement in conversion using “suggested skills”
Results
Without
“suggested skills”
With
“suggested skills”
Our Contributions
 End-to-end creation of a skills folksonomy based on free-text
specialties section
 Efficient inferred skills model with good offline performance
 Skills recommender system based on profile attributes
Thank You

Weitere ähnliche Inhalte

Andere mochten auch

Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
How to get started in Kaggle competition
How to get started in Kaggle competitionHow to get started in Kaggle competition
How to get started in Kaggle competitionMerja Kajava
 
Visualize Big Graph Data
Visualize Big Graph DataVisualize Big Graph Data
Visualize Big Graph DataMathieu Bastian
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Lucidworks
 
Mining Methods
Mining MethodsMining Methods
Mining MethodsVR M
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014francelabs
 

Andere mochten auch (7)

Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
How to get started in Kaggle competition
How to get started in Kaggle competitionHow to get started in Kaggle competition
How to get started in Kaggle competition
 
Visualize Big Graph Data
Visualize Big Graph DataVisualize Big Graph Data
Visualize Big Graph Data
 
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
 
Mining ppt 2014
Mining ppt 2014Mining ppt 2014
Mining ppt 2014
 
Mining Methods
Mining MethodsMining Methods
Mining Methods
 
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014
 

Ähnlich wie LinkedIn Skills: RecSys Conference 2014

Introduction to enterprise search
Introduction to enterprise searchIntroduction to enterprise search
Introduction to enterprise searchUsama Nada
 
EmployeePages The next generation staff directory
EmployeePages The next generation staff directoryEmployeePages The next generation staff directory
EmployeePages The next generation staff directoryTIMETOACT GROUP
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)
Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)
Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)Howin Chan, PHR
 
Information Architecture
Information ArchitectureInformation Architecture
Information ArchitectureOlivier Tripet
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchCareerBuilder.com
 
Microsoft The Platform For Knowledge Management 26 10 2006 V1.0
Microsoft   The Platform For Knowledge Management   26 10 2006   V1.0Microsoft   The Platform For Knowledge Management   26 10 2006   V1.0
Microsoft The Platform For Knowledge Management 26 10 2006 V1.0Peter de Haas
 
Sla canada student nov 25 2021
Sla canada student nov 25 2021Sla canada student nov 25 2021
Sla canada student nov 25 2021Stephen Abram
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19Yong Siang (Ivan) Tan
 
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...Draup3
 
Overview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial IntelligenceOverview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial IntelligenceEnterprise Knowledge
 
Software development learning path - board infinity
Software development learning path - board infinitySoftware development learning path - board infinity
Software development learning path - board infinityBoard Infinity
 
How Azure helps to build better business processes and customer experiences w...
How Azure helps to build better business processes and customer experiences w...How Azure helps to build better business processes and customer experiences w...
How Azure helps to build better business processes and customer experiences w...Maxim Salnikov
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyChiara Fox Ogan
 
Jive Software - Clearspace Overview
Jive Software - Clearspace OverviewJive Software - Clearspace Overview
Jive Software - Clearspace OverviewMeganRossFarrell
 

Ähnlich wie LinkedIn Skills: RecSys Conference 2014 (20)

Introduction to enterprise search
Introduction to enterprise searchIntroduction to enterprise search
Introduction to enterprise search
 
EmployeePages The next generation staff directory
EmployeePages The next generation staff directoryEmployeePages The next generation staff directory
EmployeePages The next generation staff directory
 
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchStructure, Personalization, Scale: A Deep Dive into LinkedIn Search
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)
Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)
Michaels-Where Creativity Happens (Dallas Ohug Nov 4, 2010)
 
Information Architecture
Information ArchitectureInformation Architecture
Information Architecture
 
Document repositories-and-metadata
Document repositories-and-metadataDocument repositories-and-metadata
Document repositories-and-metadata
 
From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
 
Microsoft The Platform For Knowledge Management 26 10 2006 V1.0
Microsoft   The Platform For Knowledge Management   26 10 2006   V1.0Microsoft   The Platform For Knowledge Management   26 10 2006   V1.0
Microsoft The Platform For Knowledge Management 26 10 2006 V1.0
 
Sla canada student nov 25 2021
Sla canada student nov 25 2021Sla canada student nov 25 2021
Sla canada student nov 25 2021
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
 
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...
 
Overview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial IntelligenceOverview of Taxonomies and Artificial Intelligence
Overview of Taxonomies and Artificial Intelligence
 
Software development learning path - board infinity
Software development learning path - board infinitySoftware development learning path - board infinity
Software development learning path - board infinity
 
How Azure helps to build better business processes and customer experiences w...
How Azure helps to build better business processes and customer experiences w...How Azure helps to build better business processes and customer experiences w...
How Azure helps to build better business processes and customer experiences w...
 
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
Playing Tag: Managed Metadata and Taxonomies in SharePoint 2010
 
Making IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture StrategyMaking IA Real: Planning an Information Architecture Strategy
Making IA Real: Planning an Information Architecture Strategy
 
MMS2010
MMS2010MMS2010
MMS2010
 
KMA Webinar: Managed Metadata Services in SharePoint 2010
KMA Webinar: Managed Metadata Services in SharePoint 2010KMA Webinar: Managed Metadata Services in SharePoint 2010
KMA Webinar: Managed Metadata Services in SharePoint 2010
 
Jive Software - Clearspace Overview
Jive Software - Clearspace OverviewJive Software - Clearspace Overview
Jive Software - Clearspace Overview
 

Kürzlich hochgeladen

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 

Kürzlich hochgeladen (20)

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 

LinkedIn Skills: RecSys Conference 2014

  • 1. LinkedIn Skills: Large-Scale Topic Extraction and Inference Mathieu Bastian LinkedIn Corporation ©2014 All Rights Reserved
  • 2. The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 313M+ 3M+ Company Pages Connecting Talent  Opportunity. At scale…
  • 3. LinkedIn Profile  313M+ profiles in 200+ countries  Organized into sections – Standardized: Companies, Titles, Industry, Location etc. – Unstandardized: Text (Summary, Position description, specialties)  Skills & Endorsements section – Introduced in 2011 – Limited to 50 skills per profile
  • 4. Skills at LinkedIn  Key component of the professional identity  Dictionary of 45k+ skills in English  Members have diverse skills – Java Programming – Ballet – Politics – Bow Hunting  Many of these are long-tailExample of a Skills section on a LinkedIn profile
  • 5. Folksonomy creation LinkedIn Corporation ©2014 All Rights Reserved
  • 6. Folksonomy creation  Create a folksonomy of skills based on LinkedIn profiles  Leverage the “specialties” section  Detect comma-separated lists and extract skill phrases  Use stop-list and exclude other entities (e.g. companies, titles, degrees)  150k skill phrases extracted after removing long-tail noise skill phrases
  • 7. Disambiguation  Need to add context to differentiate skill phrases with multiple meanings (e.g. NLP = Natural Language Processing, NLP = Neuro-linguistic programming)  Different meanings have different sets of related phrases  Use Jaccard Similarity on LinkedIn profiles for related phrases and then SVD + KMeans to identify clusers of phrases References: R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463
  • 8. De-duplication  Need to group phrases with similar meaning together. Examples: – Acronyms: B2B, Business to Business – Synonyms: Java Programming, Java Development – Typos: Government Liason  Many of the skill phrases could be tied to a Wikipedia page  Built Mechanical Turk (www.mturk.com) task to find the Wikipedia page associated with a skill phrase Java programming Java development Java http://en.wikipedia.org/wiki/Java _(programming_language) Cluster
  • 9.  Extraction based on 12M of LinkedIn profiles with “specialties”  Extracted 150k skill phrases  Clustered related phrases adding the industry context to ambiguous phrases  De-duplication using MTurk  Final master list contains 50k skills Folksonomy creation summary Examples of synonyms of “Microsoft Office”
  • 10. Inference and Recommendation LinkedIn Corporation ©2014 All Rights Reserved
  • 11.  Goal was boosting skills adoption with a recommender system: “suggested skills”  Inferring the skills members have, similar to discovering latent attributes in profiles  Develop a collaborative filtering solution using profile attributes Skills Inference and Recommendation References: A. Mislove and al. You are who you know: Inferring user profiles in online social networks. R. Jäschke and al. Tag recommendations in folksonomies. Skills Typeahead on LinkedIn Suggested Skills
  • 12.  Large number of standardized profile attributes (i.e. can be represented by a unique identifier)  Members with similar profiles attributes are likely to have similar skills (e.g. If you work at Apple, you probably know “Mac OS”) Features Type Example Cardinality Title (Headline) Product Manager Thousands Function Engineering Dozens Industry Healthcare Dozens Title (Employment Position) Product Manager Thousands Company LinkedIn Millions Group membership Healthcare Professionals Millions Skills Matlab Thousands
  • 13.  Calculate the likelihood that a member has a given skill, given his profile attributes  No direct user similarity metric  Large number of features (e.g. 3M companies) and 50k classes Problem the set of profile attributes the folksonomy of skills
  • 14.  Used a Naïve Bayes Classifier to produce inferred skills  Training data based on members already with skills  Result is a ranking of inferred skills, which can directly be used in “suggested skills”  Evaluation methodology – AUC for each skill – P@k and Recall for evaluating the recommendations Naïve Bayes Classifier with
  • 15.  Evaluate how well we can predict skills members’ have Evaluation ROC of skill “Hadoop” Distribution of ROC across all skills
  • 16.  12X improvement in conversion using “suggested skills” Results Without “suggested skills” With “suggested skills”
  • 17. Our Contributions  End-to-end creation of a skills folksonomy based on free-text specialties section  Efficient inferred skills model with good offline performance  Skills recommender system based on profile attributes

Hinweis der Redaktion

  1. Skills are a key component of the member’s professional identity. It’s very important to have a broad and compelling dictionary of skills so members can express their competencies and recruiters can find members for those skills. Today, the dictionary is rich of more than 45k thousands skills. These include the things most people expect such as PowerPoint, Matlab or Public Speaking but also soft skills and rare skills. In fact, the distribution of occurrences of skills is long-tail distributed. The top 5000 skills is enough to cover 95% of occurrences. In other words, most of our skills are rare. Yet, they are important as members expect all industries to be represented in detail. It’s important to note that our definition of skills go beyond just skills but also include areas of expertise. For instance, Natural Gas is not a skill but is a valid area of expertise one might want to add to his profile.
  2. When we started looking at this problem, it didn’t take us much time to realize that we couldn’t leverage any existing list of skills out there, mostly because they weren’t broad enough. Instead, we decided to extract these skills directly from profiles and create a master list. We knew we would face challenges such as duplicates and disambiguation but at least we knew it was done before (free text extraction) would be based on member’s data. At the time, LinkedIn had a “specialties” section on profile. It was free-text but we noticed that members would often enumerate keywords, which often were skills. We built a simple algorithm that would count the number of commas in a paragraph to decide whether it was a comma-separated list. After extracting phrases, we removed other known entities such as titles or companies. Fortunately, LinkedIn posses this data as well and it wasn’t too difficult to filter them out. Some cases were in the grey zone though. For instance: Computer Science is both a skill and a field of study. Eventually, this process created about 150k skill phrases. We used a minimum threshold of 20 occurences.
  3. Then, we tackled the problem of disambiguating these skill phrases. Many of them can have multiple meanings, especially abbreviations and acronyms. For instance, NLP can either mean Natural Language Processing but also Neuro-Linguistic Programming. There is no right or wrong answer and we should be equipped with the tools to be able to recognize one or the other based on the context. A common solution to this problem is to use the set of related phrases. The intuition is that two different meanings would have different sets of related phrases. For instance, here you can see the related phrases of two meanings of “Angels”. We define how skill phrases are related using a Jaccard Similarity on LinkedIn profile.
  4. The other important issue with folksonomies is duplicates. I’ve listed here a few of the common patters: acronyms, abbreviations, synonyms and typos. There are some data mining techniques to help cluster those phrases together but we started with something even simpler than that. During a small scale experiment, we observed that a majority of skill phrases could be tied to a Wikipedia page. We then built a Mturk task which asked turkers to find the Wikipedia page associated with a phrase. Finally, phrases that mapped to the same Wikipedia page were grouped together and the most frequent phrases was chosen as the label.
  5. Once we had a good skills master list, it was released and members were allowed to add skills on their profile, using a typeahead. Our goal though was to maximize the number of members with skills on LinkedIn so we looked for ways to suggest profile edits and designed a prompt that we named “suggested skills”. The user would be prompted whether they have these skills or not. This problem is quite similar to the discovery of latent attributes in profiles. In other words, you are inferring the attributes of an incomplete profile using the rest of the profile, or any other information available. Our goal was to have recommendations even if the user had no skills on his profile so the algorithm would have to be based on something else than previously added skills. Just recommending popular skills wouldn’t be very relevant either. Using the member’s network is a good idea but some members have small networks and our goal was to maximize coverage. Finally, we looked at using standardized profile attributes to bootstrap our inference algorithm
  6. Each profile is composed of text but also of standardized entities such as title, function, industry, field of study etc. The coverage between these various attributes vary. Some are very frequent such as industry and some are more rare (e.g. group membership). We identified all attributes that could be predictive in terms of skills.
  7. Our goal was then to model this problem and find a classification method to infer the likelihood a member has a skill. The number of features was quite large and needed a system that would easily scale. As mentioned, we don’t have a unique user similarity metric but instead a list of different profile attributes that, when shared can predict the likelihood of skills. Each member can have a different set of attributes. Some users have only an industry, others have multiple companies, multiple titles etc.
  8. What are the true positives and stuff