Constructing Knowledge Graph for Social Networks in a Deep and Holistic Way
1. Qi He
Director of Engineering
qhe@linkedin.com
Jaewon Yang Baoxu Shi
Senior Engineer
dashi@linkedin.com
Senior Staff Engineer
jeyang@linkedin.com
Constructing Knowledge Graph
for Social Networks
in a Deep and Holistic Way
2. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
3. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
5. Overview
This tutorial will be successful if:
- You learn the problem statement of constructing knowledge graph for social networks and its
technical challenges
- You learn the opportunities for tackling the technical challenges of the problem
- You learn the state-of-the-arts and our experiences of the solutions
Preliminary knowledge
Knowledge Graph Construction is a process of creating structured data: 1) a canonical
representation for every entity, 2) relationships between entities.
Methods: 1) Human curation, 2) AI modeling (ML/NLP), 3) Data ingestion
6. Problem Statement
- Knowledge Graph Construction for Social Networks
1) input data for each member in the social network is noisy, implicit and in multilingual
2) KG and the social network influence each other via multiple organic feedback loops
7. Opportunity
A deep and holistic way is the best strategy to tackle the technical challenges.
Deep: develop deep NLP models to deeply understand the input data
- noisy and implicit: train high precision language understanding models by adding small clean
data to the noisy data
- multilingual: expand a single-language KG to multilingual KGs by applying deep transfer
learning models
Holistic: grow social network and KG together via their model interactions
- refine KG by learning deep embeddings from the social network
- grow social network by learning deep embeddings from KG
- launch new products to get explicit feedback on KG from social network members
8. Q1: How can we recognize existing entities and expand new entities from noisy and multilingual
text?
- The encoder and decoder NLU approach
- Pattern + deep learning based auto-taxonomy expansion
Q2: How can we construct entity relationships with limited input data?
- Unsupervised learning
- Semi-supervised learning
- Pre-trained deep learning models (BERT family)
- Cross-lingual transfer learning (Adversarial learning, Multilingual encoder)
Q3: How can we refine KG by ingesting data from social network?
- Embedding-based entity alignment between social network and KG
- Joint representation learning on social network and KG
- Probabilistic member feedback (label/answer) aggregation from social network
The Three Technical Questions inside Opportunity
9. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
15. To power all LinkedIn product
Knowledge
Graph
Jobs Search
Job Recommendation
Recruiter Search
Talent Insights
Jobs SEO
Profile Page
People Search
SEO
New-member onboarding
PYMK
Premium
ProFinder
EGR
Notifications
GSO
Ads
Pages
Sales Navigator
LSI
Merlin
Courses
17. Enable positive flywheel effect in LinkedIn ecosystem
Input
signals
Graph
construction
Deliver
value
Engagement
18. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:45 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
20. ● Set of triples (Source entity, Relation, Target entity)
○ (Bill Gates, Founder, Microsoft)
○ (Microsoft, Located_In, Redmond)
○ ...
● Canonical representation of relations among entities
● Examples:
○ Google Knowledge Graph
○ Microsoft Satori
○ Freebase
○ LinkedIn Economic Graph
Knowledge Graph
Microsoft Bill Gates
Redmond
Founder
Located_In
21. Knowledge Graph Construction
Tesla is an Electric vehicle
company based in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
Specialized_In
Located_In
22. Task 1: Named Entity Recognition
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
23. Task 2: Relation Extraction [Next Section]
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
Specialized_In
Located_In
24. Named Entity Recognition: Challenges
1. Name Variation
a. Tesla, Tesla Motors, Tesla Inc. … -> Tesla Inc.
2. Ambiguity
a. “Apple” -> Apple Corps VS Apple Inc.
3. Incomplete Entity Dictionary [Later Section on Taxonomy Creation]
4. Multiple Languages [Later Section]
25. Web-Scale Entity Recognition
[Cucerzan 2007]
Preprocessing
Entity Recognition (Entity Tagging)
Entity Disambiguation
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla
Forecast
Tesla
Inc.
26. Two Step Approach
Entity Recognition Entity Disambiguation
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla
Forecast
Tesla
Inc.
27. Entity Recognition
1. Encoder: Generating features
2. Decoder: Doing classification
a. Classification results:
i. B-Com, B-Ind, B-Loc: Beginning of company, industry…
ii. I-Com, I-Ind, I-Loc: Inside of company, industry, …
iii. O: Outside (nothing important)
Tesla is an Electric vehicle
company located in Palo Alto.
[0.1, 0.3. -0.1, ….]
Tesla [B-Com] is an Electric
[B-Ind] vehicle [I-Ind] company
located in Palo [B-Loc] Alto
[I-Loc].
Encoding
Decoding
28. Encoder (Feature Generation)
● Traditional features
○ Bag-of-words: TF-IDF, BM25, ...
● Recent methods: Deep learning embedding
○ Model learns how to generate feature vector (embedding)
○ Word-level embedding
○ Sequence-of-word embedding
○ Sequence-of-character embedding
● IMPORTANT: These encoders are used in later sections as well!
29. Encoder: Word-level feature
● Word embedding: Learn a latent vector (embedding) wv
for each word v
● For each word in the input text, use its latent vector as an input feature
● How to learn embedding?
○ Based on which words co-occur under the same context
○ context: k-gram window
The clouds are in the sky
30. Famous word embeddings
[Mikolov et al. 2013] [Pennington et al. 2014]
● Glove: embeddings approximate the number of co-occurances
● Word2vec: embeddings approximate the probability of co-occuring
● How to choose?
○ Glove is a little bit simpler (e.g., no negative examples), but they are very similar
○ If you use public embedding, pick one with the best coverage
○ If you train on your own, either one would work
● Limitation of word embedding:
○ Does not consider ordering of words
○ Does not generalize to new words
31. Encoder: Sequence of Words
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
The clouds are in the sky
I grew up in France … I speak fluent French
RNN
LSTM
32. Encoder: Sequence of words
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
34. Encoder: Attention-based
Badhanau et al. 2015
● LSTM: Hidden state (encoding) depends mainly on the previous token
● Attention: Hidden state is computed using all tokens
○ Attention: Weights for each token
○ Worked really well in Machine translation
37. Encoder: Transformer
● Multi-head attention?
○ Use multiple keys, values, queries
○ Helps understanding different relations among tokens
■ I am taller than Jim (comparison)
■ He is none other than Bill Gates (distinguishing)
38. ● Labels must form a valid sequence
○ Beginning must happen before inside
○ If predict each token independently, nonsense may occur e.g.: Electric [I-Ind] vehicle [B-Ind]
● Use CRF to predict the entire sequence
Classification: CRF
http://www.davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/
Tesla is Electric vehicle company
39. [Lample et al. 2016]
Putting it All Together: LSTM + CRF
● Variations
○ Word-LSTM + Char-LSTM + CRF [Abkik 2018]
○ Pretrained Transfermer (BERT) + CRF
40. Named Entity Recognition: Our Experiences
● Having good encoder (good features) is most important
○ Deep learning model is very powerful, but getting enough training data can be tricky
■ Later today, we discuss how to address this by pretraining
○ If available, domain-specific features are still very useful
■ e.g., If we have list of famous companies, this can be used to generate features
● CRF is easy to add and boosts performance incrementally
41. Entity Disambiguation
● Problem Definition
○ Input: Text span
○ Output: Entity ID
Tesla is an Electric vehicle
company located in Palo Alto.
Tesla
Forecast
Tesla
Inc.
42. Entity Disambiguation
● Feature generation (Encoder)
○ Encoder for text span: Text encoders (LSTM, Transformer, …)
○ Encoder for entity-related features
■ Text features (entity description): Text encoders
■ Graph features: Graph encoders [Later]
■ Numerical features (frequency statistics): No encoder needed
● Making prediction (Decoder)
○ Multiclass classification
43. Conclusion
● Two Problems:
○ Entity Recognition: Identify text spans
○ Entity disambiguation: If there are multiple matching entities, find the best match
● Modeling architecture: Encoder and Decoder
○ Encoder: Generate features
○ Decoder: Make classification using the features
● We will use same encoders in later sections
44. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:45 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
46. Relation Extraction
Tesla is an Electric vehicle
company based in Palo Alto.
Tesla
Inc.
Electric
Vehicle
Palo
Alto,
CA
Specialized_In
Located_In
● Input: (Source entity, Target entity, Sentences)
● Output: Relation
48. Relation Extraction: Machine Learning Methods
● Supervised method:
○ Learn classifier from training data
○ Features
■ Text features using text encoders (Transformer, LSTM, …) [Previous section]
■ Graph features [Later section]
○ Similar to entity disambiguation from the previous section
● Semi-supervised method [This section]
● Unsupervised method [This section]
49. Distant Supervision
● Downside of Supervised method: Labels are sparse, expensive to get
● Solution: Leverage other database (Freebase) to label the text corpus
○ Mintz et al. 2009 “If two entities belong to a certain relation, any sentence containing those
two entities is likely to express that relation”
50. Distant Supervision
[Mintz et al. 2009]
Microsoft
Redmond
Microsoft is based in Redmond
Located_In Microsoft is headquartered in
Redmond
Microsoft has its main campus in
Redmond
Positive Example
Source (A): Microsoft
Target (B): Redmond
Sentences: (A is based in B, A is
headquartered in B, … )
Relation: Located_In
51. Distant Supervision
[Mintz et al. 2009]
Microsoft
Larry
Page
Larry Page said about Microsoft
Larry Page commented on
Microsoft
Negative Example
Source (A): Larry Page
Target (B): Microsoft
Sentences: (A said about B, A
commented on B, … )
Relation: Nothing
52. Data Programming
Snorkel [Ratner et al. 2018]
Distant Supervision
AggregatorRule-based Annotation
Crowdsourcing
Training Data
Labeling functions
● There are other ways to get weak labels
● Can we combine weak labels to get better labels
53. Data Programming
● How to aggregate? If weak labels are 1, 0, 1:
○ Majority voting: label = 1
○ Generative Model (GM):
■ Label vector 𝛬: [1, 0, 1], True label Y: Unknown
■ Assume (𝛬, Y) is generated with pw
(𝛬, Y)
■ Learn w
■ Compute pw
(Y|𝛬) using pw
(𝛬, Y)
54. Data Programming: Takeaways
● Generative model is useful when there is ~10 labels per examples
● In fact, we found that if you do weighted voting in majority voting, it works pretty well
● Key part is to find reasonably good weak labels
55. Open Information Extraction
● Unsupervised Method: Machine learning without labels
● Input: Entities and sentences
● Output: Relation phrase
Tesla is based in Palo Alto
Tesla is headquartered in Palo
Alto
Tesla has its main campus in
Palo Alto
(Tesla, Palo Alto, based in)
(Tesla, Palo Alto, headquartered
in)
(Tesla, Palo Alto, has main
campus in)
56. Open Information Extraction: ReVerb
[Fader et al. 2011]
1. From a sentence, take longest phrase satisfying either of 3:
a. a verb (e.g., invented)
b. a verb followed immediately by a preposition (e.g., located in)
c. a verb followed by nouns, adjectives, or adverbs + preposition (e.g., has atomic weight of)
2. If that phrase appears too few times, ignore
3. Apply a binary classifier to compute confidence score
a. Classification: Is the phrase a valid relation phrase?
57. ● Use ML models to extract text spans for a relation and its arguments
○ Same methods as NER models (e.g., BiLSTM + CRF, Transformer + CRF, …)
● RnnOIE [Stanovsky et al. 2018]: BiLSTM tagger
Open Information Extraction: Sequence Tagging
[Stanovsky et al. 2018 ]
Tesla is located in Palo Alto
Tesla [Arg 0] is located in
[relation] Palo Alto [Arg 1].
58. Conclusion
● Key challenge: Get enough training examples to cover wide linguistic variations
● Semi-supervised methods: Come up with heuristics to get weak labels
● Unsupervised: Extract relation phrases
○ Drawback in industry: To map the phrases to the relations, we need another ML models
● Rule of thumb to choose methods
○ Lots of training examples + Complete relation dictionary: Supervised method
○ Few examples + Complete relation dictionary: Semi-supervised method
○ Very incomplete relation dictionary: Unsupervised method
59. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
61. Extending Knowledge Graph to New Datasets
● Task: Construct a knowledge graph (KG) for a new data by leveraging existing KGs and data
● Examples
○ Domain adaptation: Build a KG from a domain-specific text corpus
■ Building a KG specialized for Healthcare industry
○ i18n: Build a knowledge graph for data in a new language
■ We have a KG for English users. Can we do the same thing for German users?
62. Challenges
● Domain specific annotation is very time consuming
○ Annotators need to have enough knowledge in the domain (or in the language)
○ Annotation tasks need to be clearly designed
○ If either is missing, data quality goes down significantly!
● Deep learning models require lots of data
○ Number of parameters in Transformer encoders: 100s of millions!
63. Solution: Transfer Learning
Data 1
Data 2
Model 1
Model 2
Data 1
Data 2
Model 1
Model 2
Knowledge
Transfer
Supervised Learning Transfer Learning
64. Transfer Learning
● Cross-domain transfer learning
○ Train a model on general-domain data and then transfer knowledge
● Cross-lingual transfer learning
○ Train a model in English and then transfer knowledge to other language
65. Cross-domain Transfer Learning: Pretrained Model
● Train a deep learning model with very large text corpus (Wikipedia and so on.)
○ Training is done without any labels
○ Model learns general patterns in natural language
● Update the model parameters using a small number of labels
○ Since the model knows natural language well, it needs smaller number of labels
66. BERT (Bidirectional Transformer)
[Devlin et al. 2018]
● Train a Transformer encoder for two prediction tasks
● (1) Masked language modeling (predict next word given a bunch of words)
67. BERT (Bidirectional Transformer)
[Devlin et al. 2018]
● Train a Transformer encoder for two prediction tasks
● (2) Next Sentence Prediction
■ Example:
■ [CLS] The man went to the store [SEP] He bought a gallon of milk [SEP]
■ Label: IsNext
■ [CLS] The man went to the store [SEP] Penguins cannot fly [SEP]
■ Label: NotNext
68. BERT: Fine-tuning
[Devlin et al. 2018]
● Fine-tuning: Making incremental updates
on model parameters for a given task
○ Sentence pair classification:
■ (Sentence 1, Sentence 2) -> True / False
○ Single sentence classification
■ (Sentence 1) -> True / False
○ Sequence Tagging (Entity Recognition)
■ (Sentence 1) -> Tags for each token
69. BERT: Results
[Devlin et al. 2018]
● Pre-trained on ~3B words (Wikipedia, Books)
● After fine-tuning, outperformed other methods in 11 benchmark data sets
○ Fine-tuning works with ~3000 examples
○ Without any task-specific feature engineering
● For entity recognition, BERT works even without fine tuning
70. BERT: Implication
● Why does it work?
○ Context comes from both direction
○ Provides different ways to fine-tune the model
○ The model seems to learn syntactic structures [Hewitt and Manning 2019]
○ Language models seem to be correlated with multiple application tasks
● We can train sophisticated deep learning model with thousands of samples!
72. BERT: Limitations
● Slow serving
○ Distillation [Sahn et al. 2019]
○ Code optimization (ONNX Runtime)
● Handling 2+ sentences together?
○ XLNet [Yang et al. 2019]
● Nearest neighbor search is hard (modular scoring is impossible)
○ SentenceBERT [Reimers and Gurevych 2019]
● Handling very long text
○ Transformer-XL [Dai et al. 2019]
73. Cross-lingual Transfer Learning
● Assume: We have a training data in English, and developed a ML model (NER or Relation)
● Can we use the data (or the model) for other languages?
74. Multilingual Encoder
Tesla is located in Palo Alto.
[0.1, 0.3. -0.1, ….]
Encoding
Decoding
Tesla befindet sich in Palo Alto.
[0.1, 0.3. -0.1, ….]
Encoding
Decoding
Tesla Palo Alto
Located_In Tesla Palo Alto
Located_In
● If the encoder gives same feature values for sentences with the same meaning in diff. languages?
○ We can reuse the decoder (classifier)
○ Decoder can be trained with English training data
75. Multilingual Word Embedding
[Mikolov et al. 2013]
● Words have similar embedding if they mean the same thing
● How to get this? Apply “Translation matrix W”
○ W can be learned from parallel word dictionary X, Y
○ W: orthogonal matrix (Procrustes alignment)
X Y
Four Cuatro
Five Cinco
Horse Caballo
Dog Perro
76. ● What if you do not have parallel word dictionary (X, Y)? We do these two steps
○ Learn translation matrix W (without word pairs (X, Y)) by adversarial learning
○ Construct (X, Y) by best-matching words between Wx and y
Unsupervised Multilingual Embedding
77. ● Adversarial learning: Train two models that compete each other
○ Wx: “Translated” embedding in a source language, y: Embedding in a target language
○ Discriminator D: Detect which language the example comes from
■ e.g., Wx is from the source language
○ Translation W: Fool the discriminator to detect the wrong language
Unsupervised Multilingual Embedding
78. Multilingual Word Embedding: Our Experience
● Public multilingual embedding may have low coverage on the data set
● Unsupervised word alignment (adversarial learning) works
○ But if you have good parallel word dictionary, Procrustes alignment is easier and better
● Our team’s approach:
○ Step 1: Train embedding on the target data set for each language separately
■ Why? to get good coverage
○ Step 2 (Optional): Run adversarial learning to get parallel word dictionary
○ Step 3: Align by solving Y ~ WX
79. ● Pretraining models that cover multiple languages
○ Multilingual BERT
○ XLM: Masked language from parallel sentences
Multilingual Pretrained Model
[Lample and Conneau 2019]
80. Adversarial Learning: Sentence Classification
[Chen et al. 2018]
● Word embedding / pretrained models: General encoder
● Can we train multilingual, task-specific encoder?
● Yes. Adversarial learning again!
○ Train these two models that compete each other
■ Generator (Encoder): Generate features
■ Discriminator: Detect language from the encoder output
■ Generator has two purposes
● Help classification (sentiment classification)
● Fool the discriminator
81. ● We have (text in source language, labels)
● Can we convert this to (text in target language, labels)?
○ Use translation to convert text
○ Use heuristics to convert labels
Other Approach: Training Data Augmentation
[Huang et al. 2019]
82. Cross-lingual Transfer Learning: Our Experience
● Game changers:
○ Cross-lingual encoder (word embedding or pretrained models)
○ Adding some (~500) hand-labelled examples for the target language
● Things that help incrementally
○ Data augmentation by Machine translation
○ Task-specific encoder by Adversarial learning
83. Conclusion
● Transfer learning: Leverage other datasets to transfer the knowledge
● Cross-domain transfer learning
○ Train a model in one domain and fine tune for the target domain
○ Pretrained deep NLP models are the state-of-the-art (BERT and its variants)
● Cross-lingual transfer learning
○ Multilingual encoder works
○ Use adversarial learning to achieve language invariance
84. Recap before the break
Task Key technologies to discuss
Named Entity Recognition and
Disambiguation
● Natural Language Understanding Models (LSTM, Transformer)
Relation Extraction ● Semi-supervised method (Distant supervision, Data programming)
● Unsupervised method (Open information extraction)
Scalable Relation Extraction with
Limited Data
● Pretrained deep learning models (BERT and friends)
● Cross-lingual transfer learning (Adversarial learning, Multilingual encoder)
85. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
86. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
89. Definition of Graph Refinement
Graph Refinement is a task that aims to infer and add missing knowledge to the graph, or
identify erroneous information.
Accounting
Data Mining
Algorithms
Director of Engineering
Staff Researcher
Graph
Refinement
Data Mining
Algorithms
Director of Engineering
Staff Researcher
Supervised Learning Reinforcement Learning...
Machine Learning
90. How to Refine the Graph?
Experts Crowd Machine Learning
Quality High Low to Medium Medium to High
Volume Low Medium High
Cost High Medium Low to Medium
91. Scalable Graph Refinement
Scalable Graph Refinement aims at refining graph at scale via ingesting large scale data.
Crowd Machine Learning
92. Data for Scalable Graph Refinement
Structured Data Crowd Labels Social Network User Activity
93. Challenges for Data Ingestion
Ingest large-scale, noisy crowdsourced data
● Q1: How to leverage existing, large-scale structured data?
● Q2: How to leverage large volume of noisy social network text data?
● Q3: How to aggregate accurate labels from crowd workers?
Social network user feedback loop
● Q4: How to validate knowledge via social signals?
● Q5: How to grow social network via constructed knowledge graph?
● Q6: How to improve social network and knowledge graph jointly?
94. Ingest Structured Data via Entity Alignment
Q1: How to leverage existing, large-scale structured data?
Entity Alignment between knowledge graphs aims to find entities in two graphs that represent the
same real-world entity.
ITransE-SA (Zhu et al. 2017)
95. Feature-based entity alignment
Q1: How to leverage existing, large-scale structured data?
The alignment score is determined by the average
string similarity between a node pair and their
neighbor pairs connected via the same edge type.
RDF-AI (Scharffe et al. 2009)
Sim(1, 9)=(StrSim(1,9) + StrSim(2,11) + StrSim(3,12))/3
Requires preprocessing (translation) and schema alignment.
96. Embedding-based Entity Alignment
Q1: How to leverage existing, large-scale structured data?
ITransE-SA (Zhu et al. 2017) Requires a set of aligned entity seed and schema alignment.
True edge
False edge
Triple loss
Optimize embedding of each graph individually.
97. Embedding-based Entity Alignment
Q1: How to leverage existing, large-scale structured data?
ITransE-SA (Zhu et al. 2017) Requires a set of aligned entity seed and schema alignment.
Alignment loss
98. Embedding-based Entity Alignment
Q1: How to leverage existing, large-scale structured data?
(Trisedya et al, 2019) Only requires schema alignment.
Located in (transitive rule)
Located in Located in
Predicates are unified.
99. Q1: How to leverage existing, large-scale structured data?
(Trisedya et al, 2019) Only requires schema alignment.
Address inconsistent attribute representations:
fa
(“Barack Obama”) ~ fa
(“Barack Hussein Obama”)
fa
(“50.9989”)~ fa
(“50.998888889”)
Embedding-based Entity Alignment
Minimize the distance between structure
embedding and attribute embedding
Align entities by computing
their cosine similarity
100. Recap on Ingesting Structured Data
● Use RDF-AI as a proof of concept if all nodes have textual features in the same language.
● If the schemas are aligned and nodes have textual features, use Trisedya 2019.
● If the schemas are aligned and existing aligned entity pairs exist, use ITransE-SA.
101. Ingest Crowdsourced Labels via Answer Aggregation
Q3: How to aggregate accurate labels from crowd workers?
Answer aggregation for crowdsourcing is a task that finds the hidden ground truth from a set of
answers given by the crowd workers.
Work with a team of high-performing analytics, data science professionals, and cross-functional
teams to identify business opportunities and develop algorithms and methodologies to address them.
Does the following job description sentence requires data science skill?
Yes or No?Aggregation
102. Q3: How to aggregate accurate labels from crowd workers?
Answer Ingestion
Social Honeypot (Lee et al, 2010)
reCAPTCHA (Von Ahn et al. 2008)
Answer Filtering
(Remove workers who failed trapping questions)
Trapping question (ground truth)
● Majority vote (Kuncheva et al. 2003)
● Weight answers by worker expertise & question difficulty
○ Trapping question-based (Khattak and Salleb-Aouissi 2011)
○ Supervised EM (binary label only) (Raykar et al, 2009)
● Snorkel (Ratner et al., 2017)
Answer
Aggregation
(Hung et al. 2013)
103. Generative Answer Aggregation -- Snorkel
Q3: How to aggregate accurate labels from crowd workers?
minstances
n workers
Label matrix Λ
human provided label y or ø if no judgement.
Instance i has label from worker j
Instance i has label yi
from worker j
Worker j and k has the same label
normalizing constant
concatenation of three vectors
true label is unknownuse contrastive divergence to
solve w without ground truth labels
probabilistic training label
104. Recap on Ingesting CrowdSourced Labels
● Always use trapping questions to filter out low quality answers/workers.
● Use majority vote as the baseline to aggregate answers.
● To further improve the answer quality, use Snorkel to aggregate the labels.
105. Knowledge Validation via Social Signals
Q4: How to validate knowledge via social signals?
Knowledge validation via social signals is a task that aims at validating factual knowledge graph
information by collecting signals from end-users directly.
Name Quality Cost Scale Setting
Crowd Workers Mid Mid to High Small to Mid Usually single task
Social Signals Mid to High Low Large Multi-task
LinkedIn’s skill validationGoogle Map’s Venue Questions
106. Social Signal Knowledge Validation in Google Map
Q4: How to validate knowledge via social signals?
Social signal collection for each
(location, attribute) pair
(Kobren. et.al 2019)
location l
attribute a
count of yes vote
count of votes
yes vote rate
expected yes rate certainty of
expected yes rate
Model the member’s voting behavior as a beta distribution
Use user vote’s to construct knowledge base for locations.
107. Social Signal Knowledge Validation in Google Map
Q4: How to validate knowledge via social signals?
(Kobren. et.al 2019)
location-features
(natural language text)
aggregated vote of other attributes
(raw count, majority vote, etc)
generated location
embedding w.r.t. attribute0
expected yes rate
certainty of expected yes rate
(determines false positive rate)
output
108. LinkedIn’s Social Skill Validation
Q4: How to validate knowledge via social signals?
(Yan et al. 2019)
Use member actions to learn the skill expertise of our members.
LinkedIn Skill Endorsement Product
Yes/yes question: Users can act without judgement.
No anonymity: Users use as a social gesture .
109. LinkedIn’s Social Skill Validation
Q4: How to validate knowledge via social signals?
(Yan et al. 2019)
Use member actions to learn the skill expertise of our members.
Compare a connection’s skills within a certain category
Normalizing the score given by the
user to remedy social gesture.
110. LinkedIn’s Social Skill Validation
Q4: How to validate knowledge via social signals?
(Yan et al. 2019)
Use member actions to learn the skill expertise of our members.
Ask viewer to rank the skill level of candidates
candidate skill
viewer
Uses a ML model to provide candidates.
111. LinkedIn’s Social Skill Validation
Q4: How to validate knowledge via social signals?
(Yan et al. 2019)
Use member actions to learn the skill expertise of our members.
Multi-task Model (member, skill, expertise score)
112. Recap on knowledge validation via social signals
● The social signals still requires answer aggregation.
● The design of social signal collection is crucial for data quality.
113. Knowledge Graph guided Social Link Prediction
Q5: How to grow social network via constructed knowledge graph?
Given a social network and a knowledge graph, Knowledge graph guided social link prediction
aims to predict member - member connections using the knowledge graph.
Social Network
Knowledge Graph
Knowledge
Graph guided
Social Link
Prediction
114. Matrix Factorization for Social Link Prediction
nmembers
n members
(Menon and Elkan. 2011)
i-j connected?
node embedding
node bias regularization
Purely based on topological information,
Q5: How to grow social network via constructed knowledge graph?
115. Social Link Prediction using Member Attributes
(Zhang et al. 2018)
Reconstructs weighted
average of neighbors’
attributes
Predicts skip-gram neighbors’
structural embedding.
f(x)
v
one-hot attribute vector
Learns attribute embeddings implicitly.
How Skip-gram works?
Q5: How to grow social network via constructed knowledge graph?
116. Social Link Prediction using Member Attributes
(Meng et al, 2019)
attribute graph
adjacent matrix
one-hot attribute vector
one-hot node vector
Reconstructed
member-member graph
Reconstructed
member-attribute graph
Attribute embeddings is a function of members and hence for social link prediction only.
Q5: How to grow social network via constructed knowledge graph?
117. Joint Representation Learning on Social and
Knowledge Graph
Q6: How to improve social network and knowledge graph jointly?
Given a social network and a knowledge graph, we want to learn node representations to refine
both graph jointly.
An example of social network + knowledge graph.
118. Ambiguous Social Connections: Person-Person connections are ambiguous.
An illustration of LinkedIn’s Heterogeneous Social Network
Colleague
candidate-recruiter
Q6: How to improve social network and knowledge graph jointly?
Joint Representation Learning on Social and
Knowledge Graph
(Shi et al, 2019)
119. Corrupted Higher-order Proximity: Cannot learn meaningful entity embeddings.
An illustration of LinkedIn’s Heterogeneous Social Network
candidate-recruiter
Not similar because
candidate-recruiter
relationship does not
indicate occupation
similarity.
Q6: How to improve social network and knowledge graph jointly?
Joint Representation Learning on Social and
Knowledge Graph
(Shi et al, 2019)
120. Joint Representation Learning on Social and
Knowledge Graph
The learned embeddings can be used to predict connections between two arbitrary types.
(Shi et al, 2019)
Q6: How to improve social network and knowledge graph jointly?
121. Methods to refine a graph in a scalable way
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
● Use graph alignment to ingest large volume of knowledge from external knowledge graphs
● Use Snorkel to aggregate and denoise crowdsourced labeled data for graph refinement.
● Design social feedback loops to collect social signals, aggregate them and refine the graph.
● Use representation learning to refine knowledge graph and social network.
122. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction
124. Taxonomy Examples
“Taxonomy is the practice and science of classification of things or
concepts, including the principles that underlie such classification”
Ciaramita, Massimiliano, et al. "Hierarchical preferences in a broad-coverage lexical taxonomy." Proceedings of the Annual Meeting of the Cognitive Science Society. Vol. 27. No. 27. 2005.
Wheeler, David L., et al. "Database resources of the national center for biotechnology information." Nucleic acids research 36.suppl_1 (2007): D13-D21.
WordNet NCBI Taxonomy Common Tree
125. Taxonomy Examples
“Taxonomy is the practice and science of classification of things or
concepts, including the principles that underlie such classification”
SOC, URL: https://www.bls.gov/soc/
Library card catalog picture credit: https://www.smithsonianmag.com/smart-news/card-catalog-dead-180956823/
US Bureau of Labor Statistics - SOC Library Card Catalog
126. Is Taxonomy a Knowledge Graph?
Machine Learning
Data Mining
Algorithms
Director of Engineering
Staff Researcher
Knowledge Graph describes relationships
between real-world entities.
(No hierarchical information)
Taxonomy describes the classification of
real-world entities and concepts.
(Has hierarchical information)
Analytics Software Development Computer Science
Data Mining Machine Learning Algorithms
Specific<-General
Scikit-Learn
127. Many taxonomies are constructed manually
Name Creator/Organizer Domain Scale Method
O*NET US Department of Labor Occupation 1167 Manual
SOC US Bureau of Labor Statistics Occupation 867 Manual
WordNet Princeton University Noun & Verbs 155,327 Manual
Global WordNet VU University Amsterdam various various Transfer & merge
NCBI Taxonomy National Center for
Biotechnology Information
Biology 657,846 Manual
LCC Library of Congress Library 227 Manual
Updating the O*NET-SOC Taxonomy URL: https://www.onetcenter.org/dl_files/UpdatingTaxonomy_Summary.pdf
Federal Register Notice, URL: https://www.bls.gov/soc/2018/soc2018final.pdf
Miller, George A. WordNet: An electronic lexical database. MIT press, 1998.
Fellbaum, Christiane. "A semantic network of english: the mother of all WordNets." EuroWordNet: A multilingual database with lexical semantic networks. Springer, Dordrecht, 1998. 137-148.
Bo Svensén. 2009. A Handbook of Lexicography. The Theory and Practice of Dictionary-Making. Cambridge University Press.
The NCBI Taxonomy database, URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245000/
Library of Congress Classification Outline, URL: http://www.loc.gov/aba/cataloging/classification/lcco/
Majority of the taxonomies are constructed manually by domain experts.
128. Updating taxonomy manually is time-consuming
On average the update rate of O*NET is 1.6 occupation per day.
O*NET Occupation Update Summary, URL: https://www.onetcenter.org/dataUpdates.html
129. Automatic Taxonomy Construction (ATC)
Problem Definition: Given text corpus and/or auxiliary data, construct a directed taxonomy
graph G=(V,E), where V is a set of taxonomy entities, and E is a set of directed edges (u -> v).
and/or
Text Corpus
Auxiliary Data
ATC Model
Inducted Taxonomy
130. Challenges of Automatic Taxonomy Construction
● Q1: How to ensure high precision of the constructed taxonomy?
● Q2: How to ensure high recall of the constructed taxonomy?
● Q3: How to reduce the need of large volume in-domain corpora?
131. Hearst Patterns model
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.
Text Corpus
Patterns generated manually
Pattern
Matching
Hypernym(wound, injury)
Hypernym(broken bone, injury)
Hypernym(treasury, civic building)
Hypernym(England, common-law country)
...
Extracted X-isA-Y relationships
(Hearst, 1992&1998)
Q1: How to ensure high precision of the constructed taxonomy?
132. Discover Hearst Patterns
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.
(Hearst, 1992&1998)
Q1: How to ensure high precision of the constructed taxonomy?
Four steps to automatically create Hearst patterns:
1.Collect noun pairs from corpora, identifying hypernym pairs using
WordNet.
2.For each noun pair, collect sentences in which both nouns occur.
3.Parse the sentences and extract patterns from the parse tree.
4.Train a hypernym classifier based on these features.
Four steps to manually create Hearst patterns:
1.Decide a lexical relation (e.g. is-A).
2.Gather a list of term pairs for which the relation is known to hold
(e.g. [Java, Programming Language])
3.Collect sentences where both terms from a term pair appear.
(e.g. Java is a general-purpose programming language)
4.Find common patterns that indicates the relation of interest
(e.g. X is a (adj.) Y)
(Snow, et. al., 2004)
133. Limitation of Pattern-based models
● Rule-based method has low recall.
● Performance relies on the completeness of patterns.
● Creating patterns is time-consuming.
● Can only extract relationships between co-occurred entities.
Distributional models can identify relationships between unobserved entity pairs.
134. Distributional Methods
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.
Q2: How to ensure high recall of the constructed taxonomy?
(Cederberg and Widdows, 2003)
Score a pair of x,y by cosine(hx
, hy
)
Top-1000 non-stop words
Allphrasesinthecorpus
Co-occur count
V*U ∑
Single-value-decomposition
hx
hy
word vector
135. Hearst + Distributional Methods
Cederberg, Scott, and Dominic Widdows. "Using LSA and noun coordination information to improve the precision and recall of automatic hyponymy extraction." HLT-NAACL, 2003.
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." ACL. 2018.
Q2: How to ensure high recall of the constructed taxonomy?
Score a pair of x,y by spmi(x,y) = ux
∑r
vy
All phrases in the corpus
Allphrasesinthecorpus
V*U ∑r
Single-value-decomposition
ux
vy
Score hypernym(x,y) by
● P(x,y) = count of extracted hypernym(x,y) / total extractions
● Positive Pointwise Mutual Information (Roller et. al., 2018)
137. Limitation of Distributional methods
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
● Requires large amount of in-domain text corpora.
● Can only extract relationships between entities exist in the text corpora.
● Lexical memorization (memorizing certain words correlate with certain label) (Levy et al. 2015)
Instead of requiring in-domain text corpora, one can induce a taxonomy from existing taxonomy.
Q2: How to ensure high recall of the constructed taxonomy?
138. Keyword + General Purpose Taxonomy
Q3: How to reduce the need of large volume of in-domain text corpora?
(Liu et. al., 2011)
Domain-specific
Keywords
Indiana cheap car insurance
Parent concept probabilities
Search keywords’ context
Bag-of-word
vector
Hierarchical
Clustering
Domain-specific Taxonomy
General purpose taxonomy
139. Refine Taxonomy via Hyperbolic Embeddings
Q3: How to reduce the need of large volume of in-domain text corpora?
(Nickel and Kiela, 2017, Le et. al., 2019)
Existing taxonomy graph
in the format of (u,v) edges
Hyperbolic
Embedding
Model
Learned taxonomy
Hyperbolic embedding models infer taxonomy from existing graphs instead of text corpora.
140. Steps to create a taxonomy
Roller, Stephen, Douwe Kiela, and Maximilian Nickel. "Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2018.
● Collect large amount of in-domain data
● Use Hearst rules to extract high precision taxonomy (Hearst 1998, Snow 2014)
● Use distributional method to improve recall (Roller 2018, Mao 2018)
● Use hyperbolic embedding to refine taxonomy structure (Le 2019)
● Extend your taxonomy using new entities / keywords and a general-purpose taxonomy (Liu 2011)
141. Tutorial’s
Agenda
09:00 Introduction
11:45 Automated Taxonomy Expansion
09:05 Overview of LinkedIn’s Knowledge Graph and Applications
09:15 Named Entity Recognition
09:35 Populate Relationships between Social Network Entities
10:00 Scalable Relationship Extraction with Limited Data
10:30 Coffee Break
11:00 Scalable Graph Refinement via Multi-channel Data Ingestion
12:00 Conclusion and Q&A
Part 1: Construct high-quality knowledge graph for social networks
Part 2: Weakly-supervised, scalable social network knowledge graph construction