Linking Entities for Enriching and Structuring Social Media Content

Linking Entities for Enriching
and Structuring Social
Media Content
Raphaël Troncy <raphael.troncy@eurecom.fr>
@rtroncy

12/04/2016 NLPIT Workshop @ WWW 2016 - 2

Extracting and Linking Entities (NER/NEL)
 “ Tampa Bay Lightning vs Canadiens in
Montreal tonight with @erikmannens
#hockey #NHL ”
https://www.youtube.com/
watch?v=Rmug-PUyIzI

Part of Speech (GATE Twitter POS)
Tampa NNP
Bay NNP
Lightning NNP
vs CC
Canadiens NNP
in IN
Montreal NNP
tonight NN
with IN
@erikmannens USR
#hockey HT
#NHL HT
12/04/2016 NLPIT Workshop @ WWW 2016
NER: What is NHL?
- 4
https://gate.ac.uk/wiki/twitter-postagger.html
NEL: Which Montreal
are we talking about?

What is #NHL? Type Ambiguity
Sports League
Organization
Place
Railway Line

What is #NHL? Type Ambiguity
http://schema.org
/SportsEvent
http://dbpedia.org/
ontology/Event
http://schema.org
/Organization
http://dbpedia.org/
ontology/IceHocke
yLeague
Different infobox
templates

Named Entity Recognition (NER)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC
tonight NN O
with IN O
@erikmannens USR PER
#hockey HT THG
#NHL HT ORG

What is Montreal? Name Ambiguity
Montréal, Ardèche Montréal, Aude Montréal, Gers
Montreal, Wisconsin
Mont-ral, Catalonia
- 8

Named Entity Linking (NEL)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC http://dbpedia.org/resource/Montreal
tonight NN O
with IN O
@erikmannens USR PER NIL
#hockey HT THG
#NHL HT ORG

NERD: a framework for comparing NER APIs
 NER
Stanford CoreNLP
 Web APIs
http://nerd.eurecom.fr/

NERD: AlchemyAPI
Incorrect boundaries
No disambiguation
No dereferencing for @mention

NERD: Dandelion
Everything is a Thing

NERDML

Research Questions
 How to adapt an entity linking system
depending on different criteria?
 How to design an entity linking system in
order to be able to process a large amount of
data in near real time?

ADEL: Adaptive Framework for NER
 POS Tagger:
 use bidirectional
dependency
network
 combine CMM
left to right and
right to left
 NER:
 use CRF with Gibbs sampling (Monte Carlo for approximate
inference) to take n words into account instead of only the previous
and next one

ADEL: Overlap Resolution
 Detect overlaps among extractors with the boundaries
of the entities
 Different heuristics can be applied:
 Merge: (“United States” and “States of America” => “United States of
America”) default behavior
 Simple Substring: (“Florence” and “Florence May Harding” =>
”Florence” and “May Harding”)
 Smart Substring: (”Giants of New York” and “New York” => “Giants”
and “New York”)

ADEL: KB Indexing
 Create index from
DBpedia and
Wikipedia
 Integrate external data
such as PageRank
and HITS scores from
Hasso Platner Institute

ADEL: Adaptive Framework for NEL
 Generate candidate links
for all extracted mentions:
 If any, they go to the linking
method
 If not, they are linked to NIL
 Linking method:
 ADEL linear formula:
r(l): the score of the candidate l
L: the Levenshtein distance
m: the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the
candidate l
PR: Pagerank associated to the candidate l
a, b and c are weights
following the properties:
a > b > c and a + b + c = 1

ADEL: Pruning for NER/NEL
 k-NN machine learning
algorithm
 Why a pruning module?
 Useful to correct the errors from the extractor by removing wrong
annotations. Example:
 France played against Russia for a friendly match
 Yesterday, I went to see Against in concert
 Useful to adapt the annotations in order to follow a given guideline
Example: suppose we are participating to two different challenges,
the first one count the dates as entities, and the second one does not
 NEEL challenge: Jimmy Page was born the January 9th, 1944.
 OKE challenge: Jimmy Page was born the January 9th, 1944.

ADEL Evaluation
 #Micropost2014 NEEL Challenge – ADEL v1
 OKE2015 Challenge – ADEL v1
 OKE2016 Challenge – ADEL v2
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-
measure
70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-
measure
60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-
measure
76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL
F-
measure
78.8
ADEL
F-
measure
56.5

ADEL Live Demo

Social Media: some definitions
 Media Item: a photo or a video that is shared on
a social network
 Micropost: a text status message that can
optionally accompany a media item
 Social Network: an online service that focuses
on building and reflecting social relationships
among people sharing interests or activities
Media Sharing Platforms: emphasis on sharing media
but blurred boundaries with social networks since users
are encouraged to react on media content
(like, comment, favorite, etc.)
NLPIT Workshop @ WWW 201612/04/2016 - 22

Media Server
 Composition of media item extractors (12 SNs)
 Rely on search APIs + a fix 30s timeout window to provide results
 Fallback on screen scraping when necessary (Twitter ecosystem)
 Implemented as a NodeJS server
 Serialize results in a common schema (JSON)
NLPIT Workshop @ WWW 201612/04/2016 - 23
https://github.com/tomayac/media-server

Deep link
Permalink
Clean text for NLP
processing
Aggregate view of ALL
social interactions
12 Social Networks

Media Finder (www2013)

Media Finder (zooming on media items)

Media Finder (timeline view)

Media Finder Architecture
 Media items harvesting using the Media Server
http://eventmedia.eurecom.fr/media-
server/search/{combined}/{term}
https://github.com/vuknje/media-server (@tomayac fork)
 Image near de-duplication
DCT signature on image and video frame,
Hamming distance between image pairs
 Clustering and disambiguation
Named Entity Extraction using NERD
Topic Generation using LDA

Media Finder (named entities clustering)

Media Finder (zooming in a cluster)

Media Finder
 Live Topic Generation from Event Streams
Published at WWW 2013 Demo Track
http://www.youtube.com/watch?v=8iRiwz7cDYY

Tracking an event: Italian Election
 Repeated queries over a period of time
We have tracked and analyzed media posts tagged as
elezioni2013 from 2013-02-26 to 2013-03-03
Cron job: every 30 minutes over the 6 days
Slice the data in 24 hours slots
 Research questions:
Can we re-create the news headlines?
 Storyboarding:
http://mediafinder.eurecom.fr/story/elezioni2013

 Dataset:
~16501 microposts containing (duplicate) media items
~21087 Named Entities extracted
 Clustering
NER and LDA
Generate Bag of Entities (BOE) disambiguated with a
DBpedia URI
 Examples:
Monti, Bersani, Italia, Berlusconi, Grillo, Stelle

 Tracking and Analyzing The 2013 Italian Election
Published at ESWC 2013 Demo Track
http://www.youtube.com/watch?v=jIMdnwMoWnk

Searching and browsing
TED Talks
GO!

“This is Nikita, a security guard from one of the bars in St. Petersburg.”
“This is Nikita, a security guard from one of the bars in St. Petersburg.”
NER
Example taken from the transcript of
https://www.ted.com/talks/2089
PERSON
FUNCTION
LOCATION
Category:
type in the NER task.
Natural Language Processing (NPL)
Task  disambiguating URL in
a knowledge base.
E.g.
http://dbpedia.org/resource/Saint_P
etersburg.
Annotations: Named Entities

1. Clustering of consecutive chapters which talk
about similar topics and entities
2. Ordering of those fragments based on
annotation relevance (TF-IDF)
3. Filtering: Hot Spots are fragments whose
relative relevance falls under the first quarter of
the final score distribution
MF: Hot Spots
Hot Spot 1
Chapters
Hot Spot 2
Hot Spots

http://www.slideshare.net/troncy

Linking Entities for Enriching and Structuring Social Media Content

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

More from Raphael Troncy

More from Raphael Troncy (20)

Recently uploaded

Recently uploaded (20)

Linking Entities for Enriching and Structuring Social Media Content