SlideShare ist ein Scribd-Unternehmen logo
1 von 78
Downloaden Sie, um offline zu lesen
Generating Pseudo-ground Truth for
Predicting New Concepts in Social Streams
David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
What is "anema"?
2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
For content interpretation and complex filtering
tasks we want to know who/what people talk about.
7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Entity Linking
8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TANK (VEHICLE)
Knowledge
Base (KB)
Document r
TANK
query q
?
?
TANK JOHNSON
9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenges
1. Entity "importance"
2. Noisy & short text (Twitter), updates in the KB
14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
!
15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
Q: How do you know an entity is important or has impact?
A: If it is in Wikipedia, it is/has
16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74/140
Challenge 2:
Noisy data & changing KB
Unsupervised method for generating 

pseudo-ground truth (for training NER)
20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Assumption
A named-entity recognizer trained only on 

KB entities will learn to recognize KB entities
21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Hahaha! Are we sure Jillert Anema isn't
Canadian? RT @rzbh: Dutch Coach's Anti-
America Rant http://on.cc.com/1htk9Wo
32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
product
organization
organization
37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
Future
KB
43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
full KB
small KB
44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
Evaluate
Evaluation
• Mention level (NER style)
• Entity level
!
47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
• Mention level (NER style)
• Entity level
!
• Mention level (NER style)
• Entity level
!
49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
• Mention level (NER style)
• Entity level
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
Experimental setup
Data:
Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)
KB: Wikipedia (Jan 4th, 2012)
!
Components:
EL: Semanticizer
NERC: Custom
53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [
{
"text": "ASAP",
"linkProbability": 0.17446043165467626,
"id": "30864663",
"senseProbability": 0.11690647482014388,
"title": "ASAP (variety show)",
"url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29",
"label": "ASAP",
"priorProbability": 0.631578947368421
},
{
"text": "ASAP Rocky",
"linkProbability": 0.9333333333333333,
"id": "33754098",
"senseProbability": 0.9333333333333333,
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
"priorProbability": 1.0
},
{
"text": "Kendrick Lamar",
"linkProbability": 0.9533333333333334,
"id": "29909823",
"senseProbability": 0.9533333333333334,
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
"priorProbability": 1.0
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
B I O B I
O O O O O O
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
person person
59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
Future
KB
NERC
Tweet
NERC
Model
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Predictions
m1
, c1
m2, c2
…
Entity
Linker
Sample
Corpus
Training data
m1
, c1
m2
, c2
From tweet to training sample
1. Convert EL output (Wikipedia concepts) to NERC labels;
• Label entity span (B-I-O) & class (PER/LOC/ORG)
!
2. Pick "good" samples
• entity linker's confidence score
• textual quality
60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [
{
"text": "ASAP Rocky",
"linkProbability": 0.9333333333333333,
"id": "33754098",
"senseProbability": 0.9333333333333333,
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
"priorProbability": 1.0
},
{
"text": "Kendrick Lamar",
"linkProbability": 0.9533333333333334,
"id": "29909823",
"senseProbability": 0.9533333333333334,
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
"priorProbability": 1.0
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
Entity Class
1. Map Wikipedia entity to DBpedia entity
2. Retrieve entity class (ontology);
• if Person: PER
• if Organisation, Company, or Non-ProfitOrganisation: ORG
• if Place, PopulatedPlace, City, Country: LOC
• …?
62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling Methods
1. Entity linker confidence score
2. Textual quality
63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Extract anchor text (a) to Wikipedia page (W)-mappings
• Confidence score combines two signals:
1. How common is it that a is used as a link
2. How commonly is a used as a link to W
64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Higher threshold = fewer entities, less noise
• Lower threshold = fewer entities, more noise
65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Highest scoring tweets
1. Watching the History channel, Hitler’”⁹s
Family. Hitler hid his true family
heritage, while others had to measure
up to Aryan purity.
2. When you sense yourself becoming
negative, stop and consider what it
would mean to apply that negative
energy in the opposite direction.
3. So. After school tomorrow, french
revision class. Tuesday, Drama
rehearsal and then at 8, cricket
training. Wednesday, Drama. Thursday
… (c)
Lowest scoring tweets
1. Toni Braxton ~ He Wasn't Man
Enough for Me _HASHTAG_
_HASHTAG_? _URL_ RT _Mention_
2. tell me what u think The GetMore
Girls, Part One _URL_
3. this girl better not go off on me rt
Sampling 2: Textual quality
• Compare different sampling strategies;
• top tweets
• medium tweets
• medium+top tweets
• low+medium+top tweets (no sampling)
68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Results
69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
NERC
NERC
Model
Unlabeled Tweet
?Entity
Linker
Future
KB
Tweet
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus Sample
Corpus
Predictions
m1
, c1
m2, c2
…
RQ1: 	
 What is the impact of our sampling methods for 	
 

	
	
 	
 	
 	
 generating	
pseudo-ground truth?
Training data
m1
, c1
m2
, c2
Findings: EL confidence score threshold
1. Higher threshold, higher accuracy
!
!
!
!
!
!
!
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Solid: Precision

	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Dotted: Recall
71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
Findings: EL confidence score threshold
2. Higher threshold, more predictions
72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Findings: Textual Quality Sampling
73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Tweet
Unlabeled Tweet
?
NERC
Model
NERC
Training data
m1
, c1
m2
, c2
Sample
Corpus
m1
, e1
m2
, e2
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
RQ2: 	
What is the impact of the size of prior knowledge
	
	
 	
 	
 	
 on detecting unknown entities?
Entity
Linker
Today's
KB
Predictions
m1
, c1
m2, c2
…
Results: RQ2
Sampling 2: KB size (mentions)
!
!
!
!
!
!
	
	
 	
 	
 	
 	
 	
 	
 	
 	
	
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 blue: Our method

	
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 red: Baseline

75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
50"
20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
Conclusions
Recall increases as amount of prior knowledge grows:
1. Able to deal with missing labels, justifying approach
2. Rate of unknown entity detection increases as KB grows
76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future Work
• Next step: Closing the loop
• Feed back to KB (entity normalization)
• From PER/LOC/ORG entities to other classes:
• Books, buildings, drugs, artists, …?
• Apply to other domains, languages
• From random sampling to time-based sampling
77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Fin
Questions?
!
!
!
!
!
!
!
78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

Weitere ähnliche Inhalte

Ähnlich wie Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Ähnlich wie Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams (20)

Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentGovernment ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
 
Adaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & LivingAdaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & Living
 
Partner-Driven Innovation
Partner-Driven InnovationPartner-Driven Innovation
Partner-Driven Innovation
 
Everything that happens, happens somewhere
Everything that happens, happens somewhereEverything that happens, happens somewhere
Everything that happens, happens somewhere
 
Taking your learning further SEND
Taking your learning further SENDTaking your learning further SEND
Taking your learning further SEND
 
Technology Driven Diferentiation
Technology Driven DiferentiationTechnology Driven Diferentiation
Technology Driven Diferentiation
 
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
 
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraph
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital Humanities
 
Now What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and AppsNow What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and Apps
 
Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019
 
2013.05.13 tom de nies - snow 2013
2013.05.13   tom de nies - snow 20132013.05.13   tom de nies - snow 2013
2013.05.13 tom de nies - snow 2013
 
Possibilities to Practices II
Possibilities to Practices IIPossibilities to Practices II
Possibilities to Practices II
 
Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015
 
Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma
 
MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)
 
Tim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data SchemeTim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data Scheme
 
Lean at ITP info session 2015
Lean at ITP info session 2015Lean at ITP info session 2015
Lean at ITP info session 2015
 
Digital Storytelling
Digital Storytelling Digital Storytelling
Digital Storytelling
 

Mehr von David Graus

David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
David Graus
 

Mehr von David Graus (15)

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email Traffic
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
 

Kürzlich hochgeladen

biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 

Kürzlich hochgeladen (20)

Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.Porella : features, morphology, anatomy, reproduction etc.
Porella : features, morphology, anatomy, reproduction etc.
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

  • 1. Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
  • 2. What is "anema"? 2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 3. 3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 4. 4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 5. 5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 6. 6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 7. For content interpretation and complex filtering tasks we want to know who/what people talk about. 7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 8. Entity Linking 8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media TANK (VEHICLE) Knowledge Base (KB) Document r TANK query q ? ? TANK JOHNSON
  • 9. 9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 10. 10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 11. 11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 12. Named Entity Recognition 12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 13. Named Entity Recognition 13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 14. Challenges 1. Entity "importance" 2. Noisy & short text (Twitter), updates in the KB 14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 15. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! ! 15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 16. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! Q: How do you know an entity is important or has impact? A: If it is in Wikipedia, it is/has 16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 17. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 18. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74
  • 19. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74/140
  • 20. Challenge 2: Noisy data & changing KB Unsupervised method for generating 
 pseudo-ground truth (for training NER) 20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 21. Assumption A named-entity recognizer trained only on 
 KB entities will learn to recognize KB entities 21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 22. 22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus
  • 23. 23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet
  • 24. 24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker
  • 25. 25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 26. 26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 27. 27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 28. d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 29. 29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 30. 30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 31. 31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti- America Rant http://on.cc.com/1htk9Wo
  • 32. 32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 33. 33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus
  • 34. 34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2
  • 35. 35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 36. 36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. product organization organization
  • 37. 37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC
  • 38. 38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 39. 39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 40. 40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 …
  • 41. 41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 42. 42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 43. Future KB 43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB full KB small KB
  • 44. 44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB
  • 45. 45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 …
  • 46. 46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 … Evaluate
  • 47. Evaluation • Mention level (NER style) • Entity level ! 47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 48. 48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction • Mention level (NER style) • Entity level !
  • 49. • Mention level (NER style) • Entity level ! 49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 50. 50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation • Mention level (NER style) • Entity level This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 51. 51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 52. 52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 53. Experimental setup Data: Corpus: Twitter (TREC'11 MB: 4,832,838 tweets) KB: Wikipedia (Jan 4th, 2012) ! Components: EL: Semanticizer NERC: Custom 53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 54. 54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 55. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
  • 56. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 57. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" B I O B I O O O O O O
  • 58. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" person person
  • 59. 59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 Future KB NERC Tweet NERC Model Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Predictions m1 , c1 m2, c2 … Entity Linker Sample Corpus Training data m1 , c1 m2 , c2
  • 60. From tweet to training sample 1. Convert EL output (Wikipedia concepts) to NERC labels; • Label entity span (B-I-O) & class (PER/LOC/ORG) ! 2. Pick "good" samples • entity linker's confidence score • textual quality 60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 61. 61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 62. Entity Class 1. Map Wikipedia entity to DBpedia entity 2. Retrieve entity class (ontology); • if Person: PER • if Organisation, Company, or Non-ProfitOrganisation: ORG • if Place, PopulatedPlace, City, Country: LOC • …? 62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 63. Sampling Methods 1. Entity linker confidence score 2. Textual quality 63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 64. Sampling 1: Confidence Score • Extract anchor text (a) to Wikipedia page (W)-mappings • Confidence score combines two signals: 1. How common is it that a is used as a link 2. How commonly is a used as a link to W 64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 65. Sampling 1: Confidence Score • Higher threshold = fewer entities, less noise • Lower threshold = fewer entities, more noise 65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 66. Sampling 2: Textual quality 66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 67. Sampling 2: Textual quality 67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Highest scoring tweets 1. Watching the History channel, Hitler’”⁹s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity. 2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction. 3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c) Lowest scoring tweets 1. Toni Braxton ~ He Wasn't Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_ 2. tell me what u think The GetMore Girls, Part One _URL_ 3. this girl better not go off on me rt
  • 68. Sampling 2: Textual quality • Compare different sampling strategies; • top tweets • medium tweets • medium+top tweets • low+medium+top tweets (no sampling) 68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 69. Results 69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 70. 70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 NERC NERC Model Unlabeled Tweet ?Entity Linker Future KB Tweet Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Sample Corpus Predictions m1 , c1 m2, c2 … RQ1: What is the impact of our sampling methods for 
 generating pseudo-ground truth? Training data m1 , c1 m2 , c2
  • 71. Findings: EL confidence score threshold 1. Higher threshold, higher accuracy ! ! ! ! ! ! ! Solid: Precision
 Dotted: Recall 71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
  • 72. Findings: EL confidence score threshold 2. Higher threshold, more predictions 72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 73. Findings: Textual Quality Sampling 73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 74. 74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Tweet Unlabeled Tweet ? NERC Model NERC Training data m1 , c1 m2 , c2 Sample Corpus m1 , e1 m2 , e2 Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus RQ2: What is the impact of the size of prior knowledge on detecting unknown entities? Entity Linker Today's KB Predictions m1 , c1 m2, c2 …
  • 75. Results: RQ2 Sampling 2: KB size (mentions) ! ! ! ! ! ! blue: Our method
 red: Baseline
 75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
  • 76. Conclusions Recall increases as amount of prior knowledge grows: 1. Able to deal with missing labels, justifying approach 2. Rate of unknown entity detection increases as KB grows 76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 77. Future Work • Next step: Closing the loop • Feed back to KB (entity normalization) • From PER/LOC/ORG entities to other classes: • Books, buildings, drugs, artists, …? • Apply to other domains, languages • From random sampling to time-based sampling 77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 78. Fin Questions? ! ! ! ! ! ! ! 78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media