2. The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
313M+ 3M+
Company Pages
Connecting Talent Opportunity. At scale…
3. LinkedIn Profile
313M+ profiles in 200+ countries
Organized into sections
– Standardized: Companies, Titles, Industry,
Location etc.
– Unstandardized: Text (Summary, Position
description, specialties)
Skills & Endorsements section
– Introduced in 2011
– Limited to 50 skills per profile
4. Skills at LinkedIn
Key component of the
professional identity
Dictionary of 45k+ skills in
English
Members have diverse skills
– Java Programming
– Ballet
– Politics
– Bow Hunting
Many of these are long-tailExample of a Skills section on a LinkedIn profile
6. Folksonomy creation
Create a folksonomy of skills based on LinkedIn profiles
Leverage the “specialties” section
Detect comma-separated lists and extract skill phrases
Use stop-list and exclude other entities (e.g. companies, titles,
degrees)
150k skill phrases extracted after removing long-tail noise
skill
phrases
7. Disambiguation
Need to add context to differentiate skill phrases with multiple
meanings (e.g. NLP = Natural Language Processing,
NLP = Neuro-linguistic programming)
Different meanings have different sets of related phrases
Use Jaccard Similarity on LinkedIn profiles for related phrases and
then SVD + KMeans to identify clusers of phrases
References: R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463
8. De-duplication
Need to group phrases with similar meaning together. Examples:
– Acronyms: B2B, Business to Business
– Synonyms: Java Programming, Java Development
– Typos: Government Liason
Many of the skill phrases could be tied to a Wikipedia page
Built Mechanical Turk (www.mturk.com) task to find the Wikipedia
page associated with a skill phrase
Java programming
Java development
Java
http://en.wikipedia.org/wiki/Java
_(programming_language)
Cluster
9. Extraction based on 12M of LinkedIn profiles with “specialties”
Extracted 150k skill phrases
Clustered related phrases adding the industry context to ambiguous
phrases
De-duplication using MTurk
Final master list contains 50k skills
Folksonomy creation summary
Examples of synonyms of
“Microsoft Office”
11. Goal was boosting skills adoption with a recommender system:
“suggested skills”
Inferring the skills members have, similar to discovering latent
attributes in profiles
Develop a collaborative filtering solution using profile attributes
Skills Inference and Recommendation
References: A. Mislove and al. You are who you know: Inferring user profiles in online social networks.
R. Jäschke and al. Tag recommendations in folksonomies.
Skills Typeahead on LinkedIn
Suggested Skills
12. Large number of standardized profile attributes (i.e. can be
represented by a unique identifier)
Members with similar profiles attributes are likely to have similar
skills (e.g. If you work at Apple, you probably know “Mac OS”)
Features
Type Example Cardinality
Title (Headline) Product Manager Thousands
Function Engineering Dozens
Industry Healthcare Dozens
Title (Employment Position) Product Manager Thousands
Company LinkedIn Millions
Group membership Healthcare Professionals Millions
Skills Matlab Thousands
13. Calculate the likelihood that a member has a given
skill, given his profile attributes
No direct user similarity metric
Large number of features (e.g. 3M companies) and 50k classes
Problem
the set of profile attributes
the folksonomy of skills
14. Used a Naïve Bayes Classifier to produce inferred skills
Training data based on members already with skills
Result is a ranking of inferred skills, which can directly be used in
“suggested skills”
Evaluation methodology
– AUC for each skill
– P@k and Recall for evaluating the recommendations
Naïve Bayes Classifier
with
15. Evaluate how well we can predict skills members’ have
Evaluation
ROC of skill “Hadoop” Distribution of ROC across
all skills
16. 12X improvement in conversion using “suggested skills”
Results
Without
“suggested skills”
With
“suggested skills”
17. Our Contributions
End-to-end creation of a skills folksonomy based on free-text
specialties section
Efficient inferred skills model with good offline performance
Skills recommender system based on profile attributes
Skills are a key component of the member’s professional identity. It’s very important to have a broad and compelling dictionary of skills so members can express their competencies and recruiters can find members for those skills.
Today, the dictionary is rich of more than 45k thousands skills. These include the things most people expect such as PowerPoint, Matlab or Public Speaking but also soft skills and rare skills. In fact, the distribution of occurrences of skills is long-tail distributed. The top 5000 skills is enough to cover 95% of occurrences. In other words, most of our skills are rare. Yet, they are important as members expect all industries to be represented in detail.
It’s important to note that our definition of skills go beyond just skills but also include areas of expertise. For instance, Natural Gas is not a skill but is a valid area of expertise one might want to add to his profile.
When we started looking at this problem, it didn’t take us much time to realize that we couldn’t leverage any existing list of skills out there, mostly because they weren’t broad enough. Instead, we decided to extract these skills directly from profiles and create a master list. We knew we would face challenges such as duplicates and disambiguation but at least we knew it was done before (free text extraction) would be based on member’s data.
At the time, LinkedIn had a “specialties” section on profile. It was free-text but we noticed that members would often enumerate keywords, which often were skills. We built a simple algorithm that would count the number of commas in a paragraph to decide whether it was a comma-separated list. After extracting phrases, we removed other known entities such as titles or companies. Fortunately, LinkedIn posses this data as well and it wasn’t too difficult to filter them out. Some cases were in the grey zone though. For instance: Computer Science is both a skill and a field of study.
Eventually, this process created about 150k skill phrases. We used a minimum threshold of 20 occurences.
Then, we tackled the problem of disambiguating these skill phrases. Many of them can have multiple meanings, especially abbreviations and acronyms. For instance, NLP can either mean Natural Language Processing but also Neuro-Linguistic Programming. There is no right or wrong answer and we should be equipped with the tools to be able to recognize one or the other based on the context.
A common solution to this problem is to use the set of related phrases. The intuition is that two different meanings would have different sets of related phrases. For instance, here you can see the related phrases of two meanings of “Angels”.
We define how skill phrases are related using a Jaccard Similarity on LinkedIn profile.
The other important issue with folksonomies is duplicates. I’ve listed here a few of the common patters: acronyms, abbreviations, synonyms and typos. There are some data mining techniques to help cluster those phrases together but we started with something even simpler than that. During a small scale experiment, we observed that a majority of skill phrases could be tied to a Wikipedia page. We then built a Mturk task which asked turkers to find the Wikipedia page associated with a phrase.
Finally, phrases that mapped to the same Wikipedia page were grouped together and the most frequent phrases was chosen as the label.
Once we had a good skills master list, it was released and members were allowed to add skills on their profile, using a typeahead. Our goal though was to maximize the number of members with skills on LinkedIn so we looked for ways to suggest profile edits and designed a prompt that we named “suggested skills”. The user would be prompted whether they have these skills or not.
This problem is quite similar to the discovery of latent attributes in profiles. In other words, you are inferring the attributes of an incomplete profile using the rest of the profile, or any other information available.
Our goal was to have recommendations even if the user had no skills on his profile so the algorithm would have to be based on something else than previously added skills. Just recommending popular skills wouldn’t be very relevant either. Using the member’s network is a good idea but some members have small networks and our goal was to maximize coverage. Finally, we looked at using standardized profile attributes to bootstrap our inference algorithm
Each profile is composed of text but also of standardized entities such as title, function, industry, field of study etc. The coverage between these various attributes vary. Some are very frequent such as industry and some are more rare (e.g. group membership). We identified all attributes that could be predictive in terms of skills.
Our goal was then to model this problem and find a classification method to infer the likelihood a member has a skill. The number of features was quite large and needed a system that would easily scale. As mentioned, we don’t have a unique user similarity metric but instead a list of different profile attributes that, when shared can predict the likelihood of skills. Each member can have a different set of attributes. Some users have only an industry, others have multiple companies, multiple titles etc.