Keynote Talk given at the 2nd International Workshop on Natural Language Processing for Informal Text (NLPIT 2016).
In conjunction with 25th International World Wide Web Conference (WWW 2016), April 11-15, 2016, Montreal, Canada
3. Extracting and Linking Entities (NER/NEL)
“ Tampa Bay Lightning vs Canadiens in
Montreal tonight with @erikmannens
#hockey #NHL ”
12/04/2016 NLPIT Workshop @ WWW 2016 - 3
https://www.youtube.com/
watch?v=Rmug-PUyIzI
4. Part of Speech (GATE Twitter POS)
Tampa NNP
Bay NNP
Lightning NNP
vs CC
Canadiens NNP
in IN
Montreal NNP
tonight NN
with IN
@erikmannens USR
#hockey HT
#NHL HT
12/04/2016 NLPIT Workshop @ WWW 2016
NER: What is NHL?
- 4
https://gate.ac.uk/wiki/twitter-postagger.html
NEL: Which Montreal
are we talking about?
5. What is #NHL? Type Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016 - 5
Sports League
Organization
Place
Railway Line
6. What is #NHL? Type Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016 - 6
http://schema.org
/SportsEvent
http://dbpedia.org/
ontology/Event
http://schema.org
/Organization
http://dbpedia.org/
ontology/IceHocke
yLeague
Different infobox
templates
7. Named Entity Recognition (NER)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC
tonight NN O
with IN O
@erikmannens USR PER
#hockey HT THG
#NHL HT ORG
12/04/2016 NLPIT Workshop @ WWW 2016 - 7
8. What is Montreal? Name Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016
Montréal, Ardèche Montréal, Aude Montréal, Gers
Montreal, Wisconsin
Mont-ral, Catalonia
- 8
9. Named Entity Linking (NEL)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC http://dbpedia.org/resource/Montreal
tonight NN O
with IN O
@erikmannens USR PER NIL
#hockey HT THG
#NHL HT ORG
12/04/2016 NLPIT Workshop @ WWW 2016 - 9
10. NERD: a framework for comparing NER APIs
NER
Stanford CoreNLP
Web APIs
http://nerd.eurecom.fr/
12/04/2016 NLPIT Workshop @ WWW 2016 - 10
14. Research Questions
How to adapt an entity linking system
depending on different criteria?
How to design an entity linking system in
order to be able to process a large amount of
data in near real time?
12/04/2016 NLPIT Workshop @ WWW 2016 - 14
15. ADEL: Adaptive Framework for NER
POS Tagger:
use bidirectional
dependency
network
combine CMM
left to right and
right to left
NER:
use CRF with Gibbs sampling (Monte Carlo for approximate
inference) to take n words into account instead of only the previous
and next one
12/04/2016 NLPIT Workshop @ WWW 2016 - 15
16. ADEL: Overlap Resolution
Detect overlaps among extractors with the boundaries
of the entities
Different heuristics can be applied:
Merge: (“United States” and “States of America” => “United States of
America”) default behavior
Simple Substring: (“Florence” and “Florence May Harding” =>
”Florence” and “May Harding”)
Smart Substring: (”Giants of New York” and “New York” => “Giants”
and “New York”)
12/04/2016 NLPIT Workshop @ WWW 2016 - 16
17. ADEL: KB Indexing
Create index from
DBpedia and
Wikipedia
Integrate external data
such as PageRank
and HITS scores from
Hasso Platner Institute
12/04/2016 NLPIT Workshop @ WWW 2016 - 17
18. ADEL: Adaptive Framework for NEL
Generate candidate links
for all extracted mentions:
If any, they go to the linking
method
If not, they are linked to NIL
Linking method:
ADEL linear formula:
r(l): the score of the candidate l
L: the Levenshtein distance
m: the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the
candidate l
PR: Pagerank associated to the candidate l
a, b and c are weights
following the properties:
a > b > c and a + b + c = 1
12/04/2016 NLPIT Workshop @ WWW 2016 - 18
19. ADEL: Pruning for NER/NEL
k-NN machine learning
algorithm
Why a pruning module?
Useful to correct the errors from the extractor by removing wrong
annotations. Example:
France played against Russia for a friendly match
Yesterday, I went to see Against in concert
Useful to adapt the annotations in order to follow a given guideline
Example: suppose we are participating to two different challenges,
the first one count the dates as entities, and the second one does not
NEEL challenge: Jimmy Page was born the January 9th, 1944.
OKE challenge: Jimmy Page was born the January 9th, 1944.
12/04/2016 NLPIT Workshop @ WWW 2016 - 19
22. Social Media: some definitions
Media Item: a photo or a video that is shared on
a social network
Micropost: a text status message that can
optionally accompany a media item
Social Network: an online service that focuses
on building and reflecting social relationships
among people sharing interests or activities
Media Sharing Platforms: emphasis on sharing media
but blurred boundaries with social networks since users
are encouraged to react on media content
(like, comment, favorite, etc.)
NLPIT Workshop @ WWW 201612/04/2016 - 22
23. Media Server
Composition of media item extractors (12 SNs)
Rely on search APIs + a fix 30s timeout window to provide results
Fallback on screen scraping when necessary (Twitter ecosystem)
Implemented as a NodeJS server
Serialize results in a common schema (JSON)
NLPIT Workshop @ WWW 201612/04/2016 - 23
https://github.com/tomayac/media-server
24. 12/04/2016 NLPIT Workshop @ WWW 2016
Deep link
Permalink
Clean text for NLP
processing
Aggregate view of ALL
social interactions
12 Social Networks
28. Media Finder Architecture
Media items harvesting using the Media Server
http://eventmedia.eurecom.fr/media-
server/search/{combined}/{term}
https://github.com/vuknje/media-server (@tomayac fork)
Image near de-duplication
DCT signature on image and video frame,
Hamming distance between image pairs
Clustering and disambiguation
Named Entity Extraction using NERD
Topic Generation using LDA
12/04/2016 NLPIT Workshop @ WWW 2016 - 28
29. Media Finder (named entities clustering)
12/04/2016 NLPIT Workshop @ WWW 2016 - 29
31. Media Finder
Live Topic Generation from Event Streams
Published at WWW 2013 Demo Track
http://www.youtube.com/watch?v=8iRiwz7cDYY
12/04/2016 NLPIT Workshop @ WWW 2016 - 31
32. Tracking an event: Italian Election
Repeated queries over a period of time
We have tracked and analyzed media posts tagged as
elezioni2013 from 2013-02-26 to 2013-03-03
Cron job: every 30 minutes over the 6 days
Slice the data in 24 hours slots
Research questions:
Can we re-create the news headlines?
Storyboarding:
http://mediafinder.eurecom.fr/story/elezioni2013
12/04/2016 NLPIT Workshop @ WWW 2016 - 32
33. Tracking an event: Italian Election
Dataset:
~16501 microposts containing (duplicate) media items
~21087 Named Entities extracted
Clustering
NER and LDA
Generate Bag of Entities (BOE) disambiguated with a
DBpedia URI
Examples:
Monti, Bersani, Italia, Berlusconi, Grillo, Stelle
12/04/2016 NLPIT Workshop @ WWW 2016 - 33
34. Tracking an event: Italian Election
Tracking and Analyzing The 2013 Italian Election
Published at ESWC 2013 Demo Track
http://www.youtube.com/watch?v=jIMdnwMoWnk
12/04/2016 NLPIT Workshop @ WWW 2016 - 34
37. “This is Nikita, a security guard from one of the bars in St. Petersburg.”
“This is Nikita, a security guard from one of the bars in St. Petersburg.”
NER
Example taken from the transcript of
https://www.ted.com/talks/2089
PERSON
FUNCTION
LOCATION
Category:
type in the NER task.
Natural Language Processing (NPL)
Task disambiguating URL in
a knowledge base.
E.g.
http://dbpedia.org/resource/Saint_P
etersburg.
Annotations: Named Entities
38. 1. Clustering of consecutive chapters which talk
about similar topics and entities
2. Ordering of those fragments based on
annotation relevance (TF-IDF)
3. Filtering: Hot Spots are fragments whose
relative relevance falls under the first quarter of
the final score distribution
MF: Hot Spots
Hot Spot 1
Chapters
Hot Spot 2
Hot Spots