Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way

Qi He
Director of Engineering
qhe@linkedin.com
Jaewon Yang Baoxu Shi
Senior Engineer
dashi@linkedin.com
Senior Staff Engineer
jeyang@linkedin.com
Constructing Knowledge Graph
for Social Networks
in a Deep and Holistic Way

Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction

Introduction
Qi He
Director of Engineering,
LinkedIn

Overview
This tutorial will be successful if:
- You learn the problem statement of constructing knowledge graph for social networks and its
technical challenges
- You learn the opportunities for tackling the technical challenges of the problem
- You learn the state-of-the-arts and our experiences of the solutions
Preliminary knowledge
Knowledge Graph Construction is a process of creating structured data: 1) a canonical
representation for every entity, 2) relationships between entities.
Methods: 1) Human curation, 2) AI modeling (ML/NLP), 3) Data ingestion

Problem Statement
- Knowledge Graph Construction for Social Networks
1) input data for each member in the social network is noisy, implicit and in multilingual
2) KG and the social network influence each other via multiple organic feedback loops

Opportunity
A deep and holistic way is the best strategy to tackle the technical challenges.
Deep: develop deep NLP models to deeply understand the input data
- noisy and implicit: train high precision language understanding models by adding small clean
data to the noisy data
- multilingual: expand a single-language KG to multilingual KGs by applying deep transfer
learning models
Holistic: grow social network and KG together via their model interactions
- refine KG by learning deep embeddings from the social network
- grow social network by learning deep embeddings from KG
- launch new products to get explicit feedback on KG from social network members

Q1: How can we recognize existing entities and expand new entities from noisy and multilingual
text?
- The encoder and decoder NLU approach
- Pattern + deep learning based auto-taxonomy expansion
Q2: How can we construct entity relationships with limited input data?
- Unsupervised learning
- Semi-supervised learning
- Pre-trained deep learning models (BERT family)
- Cross-lingual transfer learning (Adversarial learning, Multilingual encoder)
Q3: How can we refine KG by ingesting data from social network?
- Embedding-based entity alignment between social network and KG
- Joint representation learning on social network and KG
- Probabilistic member feedback (label/answer) aggregation from social network
The Three Technical Questions inside Opportunity

Overview of
LinkedIn’s
Knowledge Graph
and Applications
Qi He
Director of Engineering,
LinkedIn

LinkedIn Knowledge Graph (aka Economic Graph)
Member input data 675M members
50M+ orgs
400+ industriesCertificates,
degrees & more…
200+ countries
50K+ skills
25K+
titles
Roles,
occupations
States, cities,
postal codes…
Tools, products,
technologies…
Specialty

LinkedIn skill taxonomy example
LinkedIn unique asset
Skill identity
Skill type
Relationships
ID:207
Definition:
http://en.wikipedia.org/wiki/Graphic_design
Canonical name:
EN: Graphic Design
Zh_CN: 平面设计
...
Aliases:
"fr_FR": [ "concepteurs graphiques", "editorial design"],
"en_US": …..
Skill type: hard → industry experience
Soft skills
Hard skills
Tools & Technologies
Spoken languages
Industry experience
Design
Graphic
Design
Adobe
Photoshop

Skill or not a Skill?
fear of flying
Skills Titles
phobias; self-esteem;
stress management;
hypnotherapy
hypnotherapist;
psychotherapist;
intelligence national security; military
operations; security
clearance;
Military Intelligence Officer,
Tactical Intelligence Officer
headaches holistic health; sports
injuries; neck pain;
nutrition;
chiropractor; massage
therapist; acupuncturist

Understanding Member skills
Exclude
references to the
company
Include skills
that relate to the
member with a high
confidence score
Exclude skills
with a lower
confidence score

To power all LinkedIn product
Knowledge
Graph
Jobs Search
Job Recommendation
Recruiter Search
Talent Insights
Jobs SEO
Profile Page
People Search
SEO
New-member onboarding
PYMK
Premium
ProFinder
EGR
Notifications
GSO
Ads
Pages
Sales Navigator
LSI
Merlin
Courses

Unlock the full
potential of the
LinkedIn Economic
Graph

Enable positive flywheel effect in LinkedIn ecosystem
Input
signals
Graph
construction
Deliver
value
Engagement

Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:45 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction

Named Entity
Recognition and
Disambiguation
Jaewon Yang,
LinkedIn

● Set of triples (Source entity, Relation, Target entity)
○ (Bill Gates, Founder, Microsoft)
○ (Microsoft, Located_In, Redmond)
○ ...
● Canonical representation of relations among entities
● Examples:
○ Google Knowledge Graph
○ Microsoft Satori
○ Freebase
○ LinkedIn Economic Graph
Knowledge Graph
Microsoft Bill Gates
Redmond
Founder
Located_In

Knowledge Graph Construction
Tesla is an Electric vehicle
company based in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
Specialized_In
Located_In

Task 1: Named Entity Recognition
company located in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA

Task 2: Relation Extraction [Next Section]
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
Specialized_In
Located_In

Named Entity Recognition: Challenges
1. Name Variation
a. Tesla, Tesla Motors, Tesla Inc. … -> Tesla Inc.
2. Ambiguity
a. “Apple” -> Apple Corps VS Apple Inc.
3. Incomplete Entity Dictionary [Later Section on Taxonomy Creation]
4. Multiple Languages [Later Section]

Web-Scale Entity Recognition
[Cucerzan 2007]
Preprocessing
Entity Recognition (Entity Tagging)
Entity Disambiguation
Tesla
Forecast
Tesla
Inc.

Two Step Approach
Entity Recognition Entity Disambiguation
Tesla
Forecast
Tesla
Inc.

Entity Recognition
1. Encoder: Generating features
2. Decoder: Doing classification
a. Classification results:
i. B-Com, B-Ind, B-Loc: Beginning of company, industry…
ii. I-Com, I-Ind, I-Loc: Inside of company, industry, …
iii. O: Outside (nothing important)
[0.1, 0.3. -0.1, ….]
Tesla [B-Com] is an Electric
[B-Ind] vehicle [I-Ind] company
located in Palo [B-Loc] Alto
[I-Loc].
Encoding
Decoding

Encoder (Feature Generation)
● Traditional features
○ Bag-of-words: TF-IDF, BM25, ...
● Recent methods: Deep learning embedding
○ Model learns how to generate feature vector (embedding)
○ Word-level embedding
○ Sequence-of-word embedding
○ Sequence-of-character embedding
● IMPORTANT: These encoders are used in later sections as well!

Encoder: Word-level feature
● Word embedding: Learn a latent vector (embedding) wv
for each word v
● For each word in the input text, use its latent vector as an input feature
● How to learn embedding?
○ Based on which words co-occur under the same context
○ context: k-gram window
The clouds are in the sky

Famous word embeddings
[Mikolov et al. 2013] [Pennington et al. 2014]
● Glove: embeddings approximate the number of co-occurances
● Word2vec: embeddings approximate the probability of co-occuring
● How to choose?
○ Glove is a little bit simpler (e.g., no negative examples), but they are very similar
○ If you use public embedding, pick one with the best coverage
○ If you train on your own, either one would work
● Limitation of word embedding:
○ Does not consider ordering of words
○ Does not generalize to new words

Encoder: Sequence of Words
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
The clouds are in the sky
I grew up in France … I speak fluent French
RNN
LSTM

Encoder: Sequence of words
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Encoder: Sequence of Characters
Akbik et al. 2018

Encoder: Attention-based
Badhanau et al. 2015
● LSTM: Hidden state (encoding) depends mainly on the previous token
● Attention: Hidden state is computed using all tokens
○ Attention: Weights for each token
○ Worked really well in Machine translation

Encoder: Transformer
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
● Compute attention by Query, Key, Value
○ Query: what information are you looking for?
○ Key: how important is each token?
○ Value: the content of each token
● Encoding = sum (attention[i] * value[i] for each token i)
● Attention[i] ~ query * key[i]

http://jalammar.github.io/illustrated-transformer/

● Multi-head attention?
○ Use multiple keys, values, queries
○ Helps understanding different relations among tokens
■ I am taller than Jim (comparison)
■ He is none other than Bill Gates (distinguishing)

● Labels must form a valid sequence
○ Beginning must happen before inside
○ If predict each token independently, nonsense may occur e.g.: Electric [I-Ind] vehicle [B-Ind]
● Use CRF to predict the entire sequence
Classification: CRF
http://www.davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/
Tesla is Electric vehicle company

[Lample et al. 2016]
Putting it All Together: LSTM + CRF
● Variations
○ Word-LSTM + Char-LSTM + CRF [Abkik 2018]
○ Pretrained Transfermer (BERT) + CRF

Named Entity Recognition: Our Experiences
● Having good encoder (good features) is most important
○ Deep learning model is very powerful, but getting enough training data can be tricky
■ Later today, we discuss how to address this by pretraining
○ If available, domain-specific features are still very useful
■ e.g., If we have list of famous companies, this can be used to generate features
● CRF is easy to add and boosts performance incrementally

● Problem Definition
○ Input: Text span
○ Output: Entity ID
Tesla
Forecast
Tesla
Inc.

● Feature generation (Encoder)
○ Encoder for text span: Text encoders (LSTM, Transformer, …)
○ Encoder for entity-related features
■ Text features (entity description): Text encoders
■ Graph features: Graph encoders [Later]
■ Numerical features (frequency statistics): No encoder needed
● Making prediction (Decoder)
○ Multiclass classification

Conclusion
● Two Problems:
○ Entity Recognition: Identify text spans
○ Entity disambiguation: If there are multiple matching entities, find the best match
● Modeling architecture: Encoder and Decoder
○ Encoder: Generate features
○ Decoder: Make classification using the features
● We will use same encoders in later sections

Populate
Relationships
between Social
Network Entities
LinkedIn

Relation Extraction
company based in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
Specialized_In
Located_In
● Input: (Source entity, Target entity, Sentences)
● Output: Relation

Relation Extraction: Challenges
● Challenges
○ Linguistic variation
■ {“Based in”, “Headquartered in”, …} -> Located_In
○ Ambiguity / Implicity
■ “Electric Vehicle company” -> Specialized_In

Relation Extraction: Machine Learning Methods
● Supervised method:
○ Learn classifier from training data
○ Features
■ Text features using text encoders (Transformer, LSTM, …) [Previous section]
■ Graph features [Later section]
○ Similar to entity disambiguation from the previous section
● Semi-supervised method [This section]
● Unsupervised method [This section]

Distant Supervision
● Downside of Supervised method: Labels are sparse, expensive to get
● Solution: Leverage other database (Freebase) to label the text corpus
○ Mintz et al. 2009 “If two entities belong to a certain relation, any sentence containing those
two entities is likely to express that relation”

Distant Supervision
[Mintz et al. 2009]
Microsoft
Redmond
Microsoft is based in Redmond
Located_In Microsoft is headquartered in
Redmond
Microsoft has its main campus in
Redmond
Positive Example
Source (A): Microsoft
Target (B): Redmond
Sentences: (A is based in B, A is
headquartered in B, … )
Relation: Located_In

Distant Supervision
[Mintz et al. 2009]
Microsoft
Larry
Page
Larry Page said about Microsoft
Larry Page commented on
Microsoft
Negative Example
Source (A): Larry Page
Target (B): Microsoft
Sentences: (A said about B, A
commented on B, … )
Relation: Nothing

Data Programming
Snorkel [Ratner et al. 2018]
Distant Supervision
AggregatorRule-based Annotation
Crowdsourcing
Training Data
Labeling functions
● There are other ways to get weak labels
● Can we combine weak labels to get better labels

Data Programming
● How to aggregate? If weak labels are 1, 0, 1:
○ Majority voting: label = 1
○ Generative Model (GM):
■ Label vector 𝛬: [1, 0, 1], True label Y: Unknown
■ Assume (𝛬, Y) is generated with pw
(𝛬, Y)
■ Learn w
■ Compute pw
(Y|𝛬) using pw
(𝛬, Y)

Data Programming: Takeaways
● Generative model is useful when there is ~10 labels per examples
● In fact, we found that if you do weighted voting in majority voting, it works pretty well
● Key part is to find reasonably good weak labels

Open Information Extraction
● Unsupervised Method: Machine learning without labels
● Input: Entities and sentences
● Output: Relation phrase
Tesla is based in Palo Alto
Tesla is headquartered in Palo
Alto
Tesla has its main campus in
Palo Alto
(Tesla, Palo Alto, based in)
(Tesla, Palo Alto, headquartered
in)
(Tesla, Palo Alto, has main
campus in)

Open Information Extraction: ReVerb
[Fader et al. 2011]
1. From a sentence, take longest phrase satisfying either of 3:
a. a verb (e.g., invented)
b. a verb followed immediately by a preposition (e.g., located in)
c. a verb followed by nouns, adjectives, or adverbs + preposition (e.g., has atomic weight of)
2. If that phrase appears too few times, ignore
3. Apply a binary classifier to compute confidence score
a. Classification: Is the phrase a valid relation phrase?

● Use ML models to extract text spans for a relation and its arguments
○ Same methods as NER models (e.g., BiLSTM + CRF, Transformer + CRF, …)
● RnnOIE [Stanovsky et al. 2018]: BiLSTM tagger
Open Information Extraction: Sequence Tagging
[Stanovsky et al. 2018 ]
Tesla is located in Palo Alto
Tesla [Arg 0] is located in
[relation] Palo Alto [Arg 1].

Conclusion
● Key challenge: Get enough training examples to cover wide linguistic variations
● Semi-supervised methods: Come up with heuristics to get weak labels
● Unsupervised: Extract relation phrases
○ Drawback in industry: To map the phrases to the relations, we need another ML models
● Rule of thumb to choose methods
○ Lots of training examples + Complete relation dictionary: Supervised method
○ Few examples + Complete relation dictionary: Semi-supervised method
○ Very incomplete relation dictionary: Unsupervised method

Scalable Relationship
Extraction with
Limited Data
LinkedIn

Extending Knowledge Graph to New Datasets
● Task: Construct a knowledge graph (KG) for a new data by leveraging existing KGs and data
● Examples
○ Domain adaptation: Build a KG from a domain-specific text corpus
■ Building a KG specialized for Healthcare industry
○ i18n: Build a knowledge graph for data in a new language
■ We have a KG for English users. Can we do the same thing for German users?

Challenges
● Domain specific annotation is very time consuming
○ Annotators need to have enough knowledge in the domain (or in the language)
○ Annotation tasks need to be clearly designed
○ If either is missing, data quality goes down significantly!
● Deep learning models require lots of data
○ Number of parameters in Transformer encoders: 100s of millions!

Solution: Transfer Learning
Data 1
Data 2
Model 1
Model 2
Data 1
Data 2
Model 1
Model 2
Knowledge
Transfer
Supervised Learning Transfer Learning

Transfer Learning
● Cross-domain transfer learning
○ Train a model on general-domain data and then transfer knowledge
● Cross-lingual transfer learning
○ Train a model in English and then transfer knowledge to other language

Cross-domain Transfer Learning: Pretrained Model
● Train a deep learning model with very large text corpus (Wikipedia and so on.)
○ Training is done without any labels
○ Model learns general patterns in natural language
● Update the model parameters using a small number of labels
○ Since the model knows natural language well, it needs smaller number of labels

BERT (Bidirectional Transformer)
[Devlin et al. 2018]
● Train a Transformer encoder for two prediction tasks
● (1) Masked language modeling (predict next word given a bunch of words)

BERT (Bidirectional Transformer)
● Train a Transformer encoder for two prediction tasks
● (2) Next Sentence Prediction
■ Example:
■ [CLS] The man went to the store [SEP] He bought a gallon of milk [SEP]
■ Label: IsNext
■ [CLS] The man went to the store [SEP] Penguins cannot fly [SEP]
■ Label: NotNext

BERT: Fine-tuning
● Fine-tuning: Making incremental updates
on model parameters for a given task
○ Sentence pair classification:
■ (Sentence 1, Sentence 2) -> True / False
○ Single sentence classification
■ (Sentence 1) -> True / False
○ Sequence Tagging (Entity Recognition)
■ (Sentence 1) -> Tags for each token

BERT: Results
● Pre-trained on ~3B words (Wikipedia, Books)
● After fine-tuning, outperformed other methods in 11 benchmark data sets
○ Fine-tuning works with ~3000 examples
○ Without any task-specific feature engineering
● For entity recognition, BERT works even without fine tuning

BERT: Implication
● Why does it work?
○ Context comes from both direction
○ Provides different ways to fine-tune the model
○ The model seems to learn syntactic structures [Hewitt and Manning 2019]
○ Language models seem to be correlated with multiple application tasks
● We can train sophisticated deep learning model with thousands of samples!

Pre-trained Deep Learning Models
[Sanh et al. 2019]

BERT: Limitations
● Slow serving
○ Distillation [Sahn et al. 2019]
○ Code optimization (ONNX Runtime)
● Handling 2+ sentences together?
○ XLNet [Yang et al. 2019]
● Nearest neighbor search is hard (modular scoring is impossible)
○ SentenceBERT [Reimers and Gurevych 2019]
● Handling very long text
○ Transformer-XL [Dai et al. 2019]

Cross-lingual Transfer Learning
● Assume: We have a training data in English, and developed a ML model (NER or Relation)
● Can we use the data (or the model) for other languages?

Multilingual Encoder
Tesla is located in Palo Alto.
[0.1, 0.3. -0.1, ….]
Encoding
Decoding
Tesla befindet sich in Palo Alto.
[0.1, 0.3. -0.1, ….]
Encoding
Decoding
Tesla Palo Alto
Located_In Tesla Palo Alto
Located_In
● If the encoder gives same feature values for sentences with the same meaning in diff. languages?
○ We can reuse the decoder (classifier)
○ Decoder can be trained with English training data

Multilingual Word Embedding
[Mikolov et al. 2013]
● Words have similar embedding if they mean the same thing
● How to get this? Apply “Translation matrix W”
○ W can be learned from parallel word dictionary X, Y
○ W: orthogonal matrix (Procrustes alignment)
X Y
Four Cuatro
Five Cinco
Horse Caballo
Dog Perro

● What if you do not have parallel word dictionary (X, Y)? We do these two steps
○ Learn translation matrix W (without word pairs (X, Y)) by adversarial learning
○ Construct (X, Y) by best-matching words between Wx and y
Unsupervised Multilingual Embedding

● Adversarial learning: Train two models that compete each other
○ Wx: “Translated” embedding in a source language, y: Embedding in a target language
○ Discriminator D: Detect which language the example comes from
■ e.g., Wx is from the source language
○ Translation W: Fool the discriminator to detect the wrong language
Unsupervised Multilingual Embedding

Multilingual Word Embedding: Our Experience
● Public multilingual embedding may have low coverage on the data set
● Unsupervised word alignment (adversarial learning) works
○ But if you have good parallel word dictionary, Procrustes alignment is easier and better
● Our team’s approach:
○ Step 1: Train embedding on the target data set for each language separately
■ Why? to get good coverage
○ Step 2 (Optional): Run adversarial learning to get parallel word dictionary
○ Step 3: Align by solving Y ~ WX

● Pretraining models that cover multiple languages
○ Multilingual BERT
○ XLM: Masked language from parallel sentences
Multilingual Pretrained Model
[Lample and Conneau 2019]

Adversarial Learning: Sentence Classification
[Chen et al. 2018]
● Word embedding / pretrained models: General encoder
● Can we train multilingual, task-specific encoder?
● Yes. Adversarial learning again!
○ Train these two models that compete each other
■ Generator (Encoder): Generate features
■ Discriminator: Detect language from the encoder output
■ Generator has two purposes
● Help classification (sentiment classification)
● Fool the discriminator

● We have (text in source language, labels)
● Can we convert this to (text in target language, labels)?
○ Use translation to convert text
○ Use heuristics to convert labels
Other Approach: Training Data Augmentation
[Huang et al. 2019]

Cross-lingual Transfer Learning: Our Experience
● Game changers:
○ Cross-lingual encoder (word embedding or pretrained models)
○ Adding some (~500) hand-labelled examples for the target language
● Things that help incrementally
○ Data augmentation by Machine translation
○ Task-specific encoder by Adversarial learning

Conclusion
● Transfer learning: Leverage other datasets to transfer the knowledge
● Cross-domain transfer learning
○ Train a model in one domain and fine tune for the target domain
○ Pretrained deep NLP models are the state-of-the-art (BERT and its variants)
● Cross-lingual transfer learning
○ Multilingual encoder works
○ Use adversarial learning to achieve language invariance

Recap before the break
Task Key technologies to discuss
Named Entity Recognition and
Disambiguation
● Natural Language Understanding Models (LSTM, Transformer)
Relation Extraction ● Semi-supervised method (Distant supervision, Data programming)
● Unsupervised method (Open information extraction)
Scalable Relation Extraction with
Limited Data
● Pretrained deep learning models (BERT and friends)
● Cross-lingual transfer learning (Adversarial learning, Multilingual encoder)

Scalable Graph
Refinement via
Multi-channel Data
Ingestion
Baoxu Shi
LinkedIn

Scalable Graph
Refinement via
Multi-channel Data
Ingestion
● Ingest large-scale, noisy crowd data
● Social network feedback loop

Definition of Graph Refinement
Graph Refinement is a task that aims to infer and add missing knowledge to the graph, or
identify erroneous information.
Accounting
Data Mining
Algorithms
Staff Researcher
Graph
Refinement
Data Mining
Algorithms
Staff Researcher
Supervised Learning Reinforcement Learning...
Machine Learning

How to Refine the Graph?
Experts Crowd Machine Learning
Quality High Low to Medium Medium to High
Volume Low Medium High
Cost High Medium Low to Medium

Scalable Graph Refinement
Scalable Graph Refinement aims at refining graph at scale via ingesting large scale data.
Crowd Machine Learning

Data for Scalable Graph Refinement
Structured Data Crowd Labels Social Network User Activity

Challenges for Data Ingestion
Ingest large-scale, noisy crowdsourced data
● Q1: How to leverage existing, large-scale structured data?
● Q2: How to leverage large volume of noisy social network text data?
● Q3: How to aggregate accurate labels from crowd workers?
Social network user feedback loop
● Q4: How to validate knowledge via social signals?
● Q5: How to grow social network via constructed knowledge graph?
● Q6: How to improve social network and knowledge graph jointly?

Ingest Structured Data via Entity Alignment
Q1: How to leverage existing, large-scale structured data?
Entity Alignment between knowledge graphs aims to find entities in two graphs that represent the
same real-world entity.
ITransE-SA (Zhu et al. 2017)

Feature-based entity alignment
The alignment score is determined by the average
string similarity between a node pair and their
neighbor pairs connected via the same edge type.
RDF-AI (Scharffe et al. 2009)
Sim(1, 9)=(StrSim(1,9) + StrSim(2,11) + StrSim(3,12))/3
Requires preprocessing (translation) and schema alignment.

Embedding-based Entity Alignment
ITransE-SA (Zhu et al. 2017) Requires a set of aligned entity seed and schema alignment.
True edge
False edge
Triple loss
Optimize embedding of each graph individually.

ITransE-SA (Zhu et al. 2017) Requires a set of aligned entity seed and schema alignment.
Alignment loss

(Trisedya et al, 2019) Only requires schema alignment.
Located in (transitive rule)
Located in Located in
Predicates are unified.

(Trisedya et al, 2019) Only requires schema alignment.
Address inconsistent attribute representations:
fa
(“Barack Obama”) ~ fa
(“Barack Hussein Obama”)
fa
(“50.9989”)~ fa
(“50.998888889”)
Minimize the distance between structure
embedding and attribute embedding
Align entities by computing
their cosine similarity

Recap on Ingesting Structured Data
● Use RDF-AI as a proof of concept if all nodes have textual features in the same language.
● If the schemas are aligned and nodes have textual features, use Trisedya 2019.
● If the schemas are aligned and existing aligned entity pairs exist, use ITransE-SA.

Ingest Crowdsourced Labels via Answer Aggregation
Q3: How to aggregate accurate labels from crowd workers?
Answer aggregation for crowdsourcing is a task that finds the hidden ground truth from a set of
answers given by the crowd workers.
Work with a team of high-performing analytics, data science professionals, and cross-functional
teams to identify business opportunities and develop algorithms and methodologies to address them.
Does the following job description sentence requires data science skill?
Yes or No?Aggregation

Answer Ingestion
Social Honeypot (Lee et al, 2010)
reCAPTCHA (Von Ahn et al. 2008)
Answer Filtering
(Remove workers who failed trapping questions)
Trapping question (ground truth)
● Majority vote (Kuncheva et al. 2003)
● Weight answers by worker expertise & question difficulty
○ Trapping question-based (Khattak and Salleb-Aouissi 2011)
○ Supervised EM (binary label only) (Raykar et al, 2009)
● Snorkel (Ratner et al., 2017)
Answer
Aggregation
(Hung et al. 2013)

Generative Answer Aggregation -- Snorkel
minstances
n workers
Label matrix Λ
human provided label y or ø if no judgement.
Instance i has label from worker j
Instance i has label yi
from worker j
Worker j and k has the same label
normalizing constant
concatenation of three vectors
true label is unknownuse contrastive divergence to
solve w without ground truth labels
probabilistic training label

Recap on Ingesting CrowdSourced Labels
● Always use trapping questions to filter out low quality answers/workers.
● Use majority vote as the baseline to aggregate answers.
● To further improve the answer quality, use Snorkel to aggregate the labels.

Knowledge Validation via Social Signals
Q4: How to validate knowledge via social signals?
Knowledge validation via social signals is a task that aims at validating factual knowledge graph
information by collecting signals from end-users directly.
Name Quality Cost Scale Setting
Crowd Workers Mid Mid to High Small to Mid Usually single task
Social Signals Mid to High Low Large Multi-task
LinkedIn’s skill validationGoogle Map’s Venue Questions

Social Signal Knowledge Validation in Google Map
Social signal collection for each
(location, attribute) pair
(Kobren. et.al 2019)
location l
attribute a
count of yes vote
count of votes
yes vote rate
expected yes rate certainty of
expected yes rate
Model the member’s voting behavior as a beta distribution
Use user vote’s to construct knowledge base for locations.

Social Signal Knowledge Validation in Google Map
(Kobren. et.al 2019)
location-features
(natural language text)
aggregated vote of other attributes
(raw count, majority vote, etc)
generated location
embedding w.r.t. attribute0
expected yes rate
certainty of expected yes rate
(determines false positive rate)
output

LinkedIn’s Social Skill Validation
(Yan et al. 2019)
Use member actions to learn the skill expertise of our members.
LinkedIn Skill Endorsement Product
Yes/yes question: Users can act without judgement.
No anonymity: Users use as a social gesture .

(Yan et al. 2019)
Compare a connection’s skills within a certain category
Normalizing the score given by the
user to remedy social gesture.

(Yan et al. 2019)
Ask viewer to rank the skill level of candidates
candidate skill
viewer
Uses a ML model to provide candidates.

(Yan et al. 2019)
Multi-task Model (member, skill, expertise score)

Recap on knowledge validation via social signals
● The social signals still requires answer aggregation.
● The design of social signal collection is crucial for data quality.

Knowledge Graph guided Social Link Prediction
Q5: How to grow social network via constructed knowledge graph?
Given a social network and a knowledge graph, Knowledge graph guided social link prediction
aims to predict member - member connections using the knowledge graph.
Social Network
Knowledge Graph
Knowledge
Graph guided
Social Link
Prediction

Matrix Factorization for Social Link Prediction
nmembers
n members
(Menon and Elkan. 2011)
i-j connected?
node embedding
node bias regularization
Purely based on topological information,

Social Link Prediction using Member Attributes
(Zhang et al. 2018)
Reconstructs weighted
average of neighbors’
attributes
Predicts skip-gram neighbors’
structural embedding.
f(x)
v
one-hot attribute vector
Learns attribute embeddings implicitly.
How Skip-gram works?

Social Link Prediction using Member Attributes
(Meng et al, 2019)
attribute graph
adjacent matrix
one-hot attribute vector
one-hot node vector
Reconstructed
member-member graph
Reconstructed
member-attribute graph
Attribute embeddings is a function of members and hence for social link prediction only.

Joint Representation Learning on Social and
Knowledge Graph
Q6: How to improve social network and knowledge graph jointly?
Given a social network and a knowledge graph, we want to learn node representations to refine
both graph jointly.
An example of social network + knowledge graph.

Ambiguous Social Connections: Person-Person connections are ambiguous.
An illustration of LinkedIn’s Heterogeneous Social Network
Colleague
candidate-recruiter
Knowledge Graph
(Shi et al, 2019)

Corrupted Higher-order Proximity: Cannot learn meaningful entity embeddings.
An illustration of LinkedIn’s Heterogeneous Social Network
candidate-recruiter
Not similar because
candidate-recruiter
relationship does not
indicate occupation
similarity.
Knowledge Graph
(Shi et al, 2019)

Knowledge Graph
The learned embeddings can be used to predict connections between two arbitrary types.
(Shi et al, 2019)

Methods to refine a graph in a scalable way
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
● Use graph alignment to ingest large volume of knowledge from external knowledge graphs
● Use Snorkel to aggregate and denoise crowdsourced labeled data for graph refinement.
● Design social feedback loops to collect social signals, aggregate them and refine the graph.
● Use representation learning to refine knowledge graph and social network.

Automatic
Taxonomy Expansion
Baoxu Shi,
LinkedIn

Taxonomy Examples
“Taxonomy is the practice and science of classification of things or
concepts, including the principles that underlie such classification”
Ciaramita, Massimiliano, et al. "Hierarchical preferences in a broad-coverage lexical taxonomy." Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 27. No. 27. 2005.
Wheeler, David L., et al. "Database resources of the national center for biotechnology information." Nucleic acids research 36.suppl_1 (2007): D13-D21.
WordNet NCBI Taxonomy Common Tree

Taxonomy Examples
“Taxonomy is the practice and science of classification of things or
concepts, including the principles that underlie such classification”
SOC, URL: https://www.bls.gov/soc/
Library card catalog picture credit: https://www.smithsonianmag.com/smart-news/card-catalog-dead-180956823/
US Bureau of Labor Statistics - SOC Library Card Catalog

Is Taxonomy a Knowledge Graph?
Machine Learning
Data Mining
Algorithms
Staff Researcher
Knowledge Graph describes relationships
between real-world entities.
(No hierarchical information)
Taxonomy describes the classification of
real-world entities and concepts.
(Has hierarchical information)
Analytics Software Development Computer Science
Data Mining Machine Learning Algorithms
Specific<-General
Scikit-Learn

Many taxonomies are constructed manually
Name Creator/Organizer Domain Scale Method
O*NET US Department of Labor Occupation 1167 Manual
SOC US Bureau of Labor Statistics Occupation 867 Manual
WordNet Princeton University Noun & Verbs 155,327 Manual
Global WordNet VU University Amsterdam various various Transfer & merge
NCBI Taxonomy National Center for
Biotechnology Information
Biology 657,846 Manual
LCC Library of Congress Library 227 Manual
Updating the O*NET-SOC Taxonomy URL: https://www.onetcenter.org/dl_files/UpdatingTaxonomy_Summary.pdf
Federal Register Notice, URL: https://www.bls.gov/soc/2018/soc2018final.pdf
Miller, George A. WordNet: An electronic lexical database. MIT press, 1998.
Fellbaum, Christiane. "A semantic network of english: the mother of all WordNets." EuroWordNet: A multilingual database with lexical semantic networks. Springer, Dordrecht, 1998. 137-148.
Bo Svensén. 2009. A Handbook of Lexicography. The Theory and Practice of Dictionary-Making. Cambridge University Press.
The NCBI Taxonomy database, URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245000/
Library of Congress Classification Outline, URL: http://www.loc.gov/aba/cataloging/classification/lcco/
Majority of the taxonomies are constructed manually by domain experts.

Updating taxonomy manually is time-consuming
On average the update rate of O*NET is 1.6 occupation per day.
O*NET Occupation Update Summary, URL: https://www.onetcenter.org/dataUpdates.html

Automatic Taxonomy Construction (ATC)
Problem Definition: Given text corpus and/or auxiliary data, construct a directed taxonomy
graph G=(V,E), where V is a set of taxonomy entities, and E is a set of directed edges (u -> v).
and/or
Text Corpus
Auxiliary Data
ATC Model
Inducted Taxonomy

Challenges of Automatic Taxonomy Construction
● Q1: How to ensure high precision of the constructed taxonomy?
● Q2: How to ensure high recall of the constructed taxonomy?
● Q3: How to reduce the need of large volume in-domain corpora?

Hearst Patterns model
Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.
Text Corpus
Patterns generated manually
Pattern
Matching
Hypernym(wound, injury)
Hypernym(broken bone, injury)
Hypernym(treasury, civic building)
Hypernym(England, common-law country)
...
Extracted X-isA-Y relationships
(Hearst, 1992&1998)
Q1: How to ensure high precision of the constructed taxonomy?

Discover Hearst Patterns
(Hearst, 1992&1998)
Q1: How to ensure high precision of the constructed taxonomy?
Four steps to automatically create Hearst patterns:
1.Collect noun pairs from corpora, identifying hypernym pairs using
WordNet.
2.For each noun pair, collect sentences in which both nouns occur.
3.Parse the sentences and extract patterns from the parse tree.
4.Train a hypernym classifier based on these features.
Four steps to manually create Hearst patterns:
1.Decide a lexical relation (e.g. is-A).
2.Gather a list of term pairs for which the relation is known to hold
(e.g. [Java, Programming Language])
3.Collect sentences where both terms from a term pair appear.
(e.g. Java is a general-purpose programming language)
4.Find common patterns that indicates the relation of interest
(e.g. X is a (adj.) Y)
(Snow, et. al., 2004)

Limitation of Pattern-based models
● Rule-based method has low recall.
● Performance relies on the completeness of patterns.
● Creating patterns is time-consuming.
● Can only extract relationships between co-occurred entities.
Distributional models can identify relationships between unobserved entity pairs.

Distributional Methods
Q2: How to ensure high recall of the constructed taxonomy?
(Cederberg and Widdows, 2003)
Score a pair of x,y by cosine(hx
, hy
)
Top-1000 non-stop words
Allphrasesinthecorpus
Co-occur count
V*U ∑
Single-value-decomposition
hx
hy
word vector

Hearst + Distributional Methods
Cederberg, Scott, and Dominic Widdows. "Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction." HLT-NAACL, 2003.
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." ACL. 2018.
Score a pair of x,y by spmi(x,y) = ux
∑r
vy
All phrases in the corpus
Allphrasesinthecorpus
V*U ∑r
Single-value-decomposition
ux
vy
Score hypernym(x,y) by
● P(x,y) = count of extracted hypernym(x,y) / total extractions
● Positive Pointwise Mutual Information (Roller et. al., 2018)

Hearst + Distributional Methods
LSTM + Pattern + Word Embedding
(Shwartz et. al., 2016)
+ TF-IDF features
+ Reinforcement Learning
(Mao et. al., 2018)

Limitation of Distributional methods
● Requires large amount of in-domain text corpora.
● Can only extract relationships between entities exist in the text corpora.
● Lexical memorization (memorizing certain words correlate with certain label) (Levy et al. 2015)
Instead of requiring in-domain text corpora, one can induce a taxonomy from existing taxonomy.

Keyword + General Purpose Taxonomy
Q3: How to reduce the need of large volume of in-domain text corpora?
(Liu et. al., 2011)
Domain-specific
Keywords
Indiana cheap car insurance
Parent concept probabilities
Search keywords’ context
Bag-of-word
vector
Hierarchical
Clustering
Domain-specific Taxonomy
General purpose taxonomy

Refine Taxonomy via Hyperbolic Embeddings
Q3: How to reduce the need of large volume of in-domain text corpora?
(Nickel and Kiela, 2017, Le et. al., 2019)
Existing taxonomy graph
in the format of (u,v) edges
Hyperbolic
Embedding
Model
Learned taxonomy
Hyperbolic embedding models infer taxonomy from existing graphs instead of text corpora.

Steps to create a taxonomy
● Collect large amount of in-domain data
● Use Hearst rules to extract high precision taxonomy (Hearst 1998, Snow 2014)
● Use distributional method to improve recall (Roller 2018, Mao 2018)
● Use hyperbolic embedding to refine taxonomy structure (Le 2019)
● Extend your taxonomy using new entities / keywords and a general-purpose taxonomy (Liu 2011)

Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way

Ähnlich wie Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way