SlideShare a Scribd company logo
1 of 40
Linking Entities for Enriching
and Structuring Social
Media Content
Raphaël Troncy <raphael.troncy@eurecom.fr>
@rtroncy
12/04/2016 NLPIT Workshop @ WWW 2016 - 2
Extracting and Linking Entities (NER/NEL)
 “ Tampa Bay Lightning vs Canadiens in
Montreal tonight with @erikmannens
#hockey #NHL ”
12/04/2016 NLPIT Workshop @ WWW 2016 - 3
https://www.youtube.com/
watch?v=Rmug-PUyIzI
Part of Speech (GATE Twitter POS)
Tampa NNP
Bay NNP
Lightning NNP
vs CC
Canadiens NNP
in IN
Montreal NNP
tonight NN
with IN
@erikmannens USR
#hockey HT
#NHL HT
12/04/2016 NLPIT Workshop @ WWW 2016
NER: What is NHL?
- 4
https://gate.ac.uk/wiki/twitter-postagger.html
NEL: Which Montreal
are we talking about?
What is #NHL? Type Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016 - 5
Sports League
Organization
Place
Railway Line
What is #NHL? Type Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016 - 6
http://schema.org
/SportsEvent
http://dbpedia.org/
ontology/Event
http://schema.org
/Organization
http://dbpedia.org/
ontology/IceHocke
yLeague
Different infobox
templates
Named Entity Recognition (NER)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC
tonight NN O
with IN O
@erikmannens USR PER
#hockey HT THG
#NHL HT ORG
12/04/2016 NLPIT Workshop @ WWW 2016 - 7
What is Montreal? Name Ambiguity
12/04/2016 NLPIT Workshop @ WWW 2016
Montréal, Ardèche Montréal, Aude Montréal, Gers
Montreal, Wisconsin
Mont-ral, Catalonia
- 8
Named Entity Linking (NEL)
Tampa NNP ORG
Bay NNP ORG
Lightning NNP ORG
vs CC O
Canadiens NNP ORG
in IN O
Montreal NNP LOC http://dbpedia.org/resource/Montreal
tonight NN O
with IN O
@erikmannens USR PER NIL
#hockey HT THG
#NHL HT ORG
12/04/2016 NLPIT Workshop @ WWW 2016 - 9
NERD: a framework for comparing NER APIs
 NER
Stanford CoreNLP
 Web APIs
http://nerd.eurecom.fr/
12/04/2016 NLPIT Workshop @ WWW 2016 - 10
NERD: AlchemyAPI
12/04/2016 NLPIT Workshop @ WWW 2016 - 11
Incorrect boundaries
No disambiguation
No dereferencing for @mention
NERD: Dandelion
12/04/2016 NLPIT Workshop @ WWW 2016 - 12
Everything is a Thing
No dereferencing for @mention
NERDML
12/04/2016 NLPIT Workshop @ WWW 2016 - 13
No dereferencing for @mention
Research Questions
 How to adapt an entity linking system
depending on different criteria?
 How to design an entity linking system in
order to be able to process a large amount of
data in near real time?
12/04/2016 NLPIT Workshop @ WWW 2016 - 14
ADEL: Adaptive Framework for NER
 POS Tagger:
 use bidirectional
dependency
network
 combine CMM
left to right and
right to left
 NER:
 use CRF with Gibbs sampling (Monte Carlo for approximate
inference) to take n words into account instead of only the previous
and next one
12/04/2016 NLPIT Workshop @ WWW 2016 - 15
ADEL: Overlap Resolution
 Detect overlaps among extractors with the boundaries
of the entities
 Different heuristics can be applied:
 Merge: (“United States” and “States of America” => “United States of
America”) default behavior
 Simple Substring: (“Florence” and “Florence May Harding” =>
”Florence” and “May Harding”)
 Smart Substring: (”Giants of New York” and “New York” => “Giants”
and “New York”)
12/04/2016 NLPIT Workshop @ WWW 2016 - 16
ADEL: KB Indexing
 Create index from
DBpedia and
Wikipedia
 Integrate external data
such as PageRank
and HITS scores from
Hasso Platner Institute
12/04/2016 NLPIT Workshop @ WWW 2016 - 17
ADEL: Adaptive Framework for NEL
 Generate candidate links
for all extracted mentions:
 If any, they go to the linking
method
 If not, they are linked to NIL
 Linking method:
 ADEL linear formula:
r(l): the score of the candidate l
L: the Levenshtein distance
m: the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the
candidate l
PR: Pagerank associated to the candidate l
a, b and c are weights
following the properties:
a > b > c and a + b + c = 1
12/04/2016 NLPIT Workshop @ WWW 2016 - 18
ADEL: Pruning for NER/NEL
 k-NN machine learning
algorithm
 Why a pruning module?
 Useful to correct the errors from the extractor by removing wrong
annotations. Example:
 France played against Russia for a friendly match
 Yesterday, I went to see Against in concert
 Useful to adapt the annotations in order to follow a given guideline
Example: suppose we are participating to two different challenges,
the first one count the dates as entities, and the second one does not
 NEEL challenge: Jimmy Page was born the January 9th, 1944.
 OKE challenge: Jimmy Page was born the January 9th, 1944.
12/04/2016 NLPIT Workshop @ WWW 2016 - 19
ADEL Evaluation
 #Micropost2014 NEEL Challenge – ADEL v1
 #Micropost2015 NEEL Challenge – ADEL v1
 #Micropost2016 NEEL Challenge – ADEL v2
 OKE2015 Challenge – ADEL v1
 OKE2016 Challenge – ADEL v2
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-
measure
70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-
measure
60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-
measure
76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL
F-
measure
78.8
ADEL
F-
measure
56.5
12/04/2016 NLPIT Workshop @ WWW 2016 - 20
ADEL Live Demo
12/04/2016 NLPIT Workshop @ WWW 2016 - 21
Social Media: some definitions
 Media Item: a photo or a video that is shared on
a social network
 Micropost: a text status message that can
optionally accompany a media item
 Social Network: an online service that focuses
on building and reflecting social relationships
among people sharing interests or activities
Media Sharing Platforms: emphasis on sharing media
but blurred boundaries with social networks since users
are encouraged to react on media content
(like, comment, favorite, etc.)
NLPIT Workshop @ WWW 201612/04/2016 - 22
Media Server
 Composition of media item extractors (12 SNs)
 Rely on search APIs + a fix 30s timeout window to provide results
 Fallback on screen scraping when necessary (Twitter ecosystem)
 Implemented as a NodeJS server
 Serialize results in a common schema (JSON)
NLPIT Workshop @ WWW 201612/04/2016 - 23
https://github.com/tomayac/media-server
12/04/2016 NLPIT Workshop @ WWW 2016
Deep link
Permalink
Clean text for NLP
processing
Aggregate view of ALL
social interactions
12 Social Networks
Media Finder (www2013)
12/04/2016 NLPIT Workshop @ WWW 2016 - 25
Media Finder (zooming on media items)
12/04/2016 NLPIT Workshop @ WWW 2016 - 26
Media Finder (timeline view)
12/04/2016 NLPIT Workshop @ WWW 2016 - 27
Media Finder Architecture
 Media items harvesting using the Media Server
http://eventmedia.eurecom.fr/media-
server/search/{combined}/{term}
https://github.com/vuknje/media-server (@tomayac fork)
 Image near de-duplication
DCT signature on image and video frame,
Hamming distance between image pairs
 Clustering and disambiguation
Named Entity Extraction using NERD
Topic Generation using LDA
12/04/2016 NLPIT Workshop @ WWW 2016 - 28
Media Finder (named entities clustering)
12/04/2016 NLPIT Workshop @ WWW 2016 - 29
Media Finder (zooming in a cluster)
12/04/2016 NLPIT Workshop @ WWW 2016 - 30
Media Finder
 Live Topic Generation from Event Streams
Published at WWW 2013 Demo Track
http://www.youtube.com/watch?v=8iRiwz7cDYY
12/04/2016 NLPIT Workshop @ WWW 2016 - 31
Tracking an event: Italian Election
 Repeated queries over a period of time
We have tracked and analyzed media posts tagged as
elezioni2013 from 2013-02-26 to 2013-03-03
Cron job: every 30 minutes over the 6 days
Slice the data in 24 hours slots
 Research questions:
Can we re-create the news headlines?
 Storyboarding:
http://mediafinder.eurecom.fr/story/elezioni2013
12/04/2016 NLPIT Workshop @ WWW 2016 - 32
Tracking an event: Italian Election
 Dataset:
~16501 microposts containing (duplicate) media items
~21087 Named Entities extracted
 Clustering
NER and LDA
Generate Bag of Entities (BOE) disambiguated with a
DBpedia URI
 Examples:
Monti, Bersani, Italia, Berlusconi, Grillo, Stelle
12/04/2016 NLPIT Workshop @ WWW 2016 - 33
Tracking an event: Italian Election
 Tracking and Analyzing The 2013 Italian Election
Published at ESWC 2013 Demo Track
http://www.youtube.com/watch?v=jIMdnwMoWnk
12/04/2016 NLPIT Workshop @ WWW 2016 - 34
Searching and browsing
TED Talks
GO!
MF: Chapters
“This is Nikita, a security guard from one of the bars in St. Petersburg.”
“This is Nikita, a security guard from one of the bars in St. Petersburg.”
NER
Example taken from the transcript of
https://www.ted.com/talks/2089
PERSON
FUNCTION
LOCATION
Category:
type in the NER task.
Natural Language Processing (NPL)
Task  disambiguating URL in
a knowledge base.
E.g.
http://dbpedia.org/resource/Saint_P
etersburg.
Annotations: Named Entities
1. Clustering of consecutive chapters which talk
about similar topics and entities
2. Ordering of those fragments based on
annotation relevance (TF-IDF)
3. Filtering: Hot Spots are fragments whose
relative relevance falls under the first quarter of
the final score distribution
MF: Hot Spots
Hot Spot 1
Chapters
Hot Spot 2
Hot Spots
Hyperlink: Indexing TED Talks
http://www.slideshare.net/troncy
12/04/2016 NLPIT Workshop @ WWW 2016 - 40

More Related Content

Viewers also liked

InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016InVID Project
 
Summer Training In Dotnet
Summer Training In DotnetSummer Training In Dotnet
Summer Training In DotnetDUCC Systems
 
LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...Charles Hardy
 
Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1davide geraci
 
Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)Md Yeakub Hossain
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...Raphael Troncy
 

Viewers also liked (9)

PROVA DE BIOLOGIA
PROVA DE BIOLOGIA PROVA DE BIOLOGIA
PROVA DE BIOLOGIA
 
Aula01 senac
Aula01 senacAula01 senac
Aula01 senac
 
InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016InVID Project Presentation 3rd release March 2016
InVID Project Presentation 3rd release March 2016
 
Edital Técnico de Enfermagem 2016
Edital Técnico de Enfermagem 2016Edital Técnico de Enfermagem 2016
Edital Técnico de Enfermagem 2016
 
Summer Training In Dotnet
Summer Training In DotnetSummer Training In Dotnet
Summer Training In Dotnet
 
LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...LinkedIn for Students and Graduates - how to start networking and checking al...
LinkedIn for Students and Graduates - how to start networking and checking al...
 
Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1Presentazione Davide Geraci Mutuo BNL 2in1
Presentazione Davide Geraci Mutuo BNL 2in1
 
Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)Financial accounting MCQ (ledger)
Financial accounting MCQ (ledger)
 
A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...A replication study of the top performing systems in SemEval twitter sentimen...
A replication study of the top performing systems in SemEval twitter sentimen...
 

More from Raphael Troncy

K CAP 2019 Opening Ceremony
K CAP 2019 Opening CeremonyK CAP 2019 Opening Ceremony
K CAP 2019 Opening CeremonyRaphael Troncy
 
Semantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things EnvironmentSemantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things EnvironmentRaphael Troncy
 
HyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learningHyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learningRaphael Troncy
 
Location Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip RecommendationLocation Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip RecommendationRaphael Troncy
 
Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014Raphael Troncy
 
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014Raphael Troncy
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...Raphael Troncy
 
Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013Raphael Troncy
 
Describing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and DescriptionDescribing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and DescriptionRaphael Troncy
 
Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013Raphael Troncy
 
Semantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social webSemantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social webRaphael Troncy
 
Live topic generation from event streams
Live topic generation from event streamsLive topic generation from event streams
Live topic generation from event streamsRaphael Troncy
 
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdMediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdRaphael Troncy
 
EventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance ContentEventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance ContentRaphael Troncy
 
Extracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social NetworksExtracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social NetworksRaphael Troncy
 
Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Raphael Troncy
 
MediaEval 2012 SED Opening
MediaEval 2012 SED OpeningMediaEval 2012 SED Opening
MediaEval 2012 SED OpeningRaphael Troncy
 
DeRiVE 2011 workshop opening
DeRiVE 2011 workshop openingDeRiVE 2011 workshop opening
DeRiVE 2011 workshop openingRaphael Troncy
 
MediaEval 2011 SED Opening
MediaEval 2011 SED OpeningMediaEval 2011 SED Opening
MediaEval 2011 SED OpeningRaphael Troncy
 
ShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting EventsShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting EventsRaphael Troncy
 

More from Raphael Troncy (20)

K CAP 2019 Opening Ceremony
K CAP 2019 Opening CeremonyK CAP 2019 Opening Ceremony
K CAP 2019 Opening Ceremony
 
Semantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things EnvironmentSemantic Technologies for Connected Vehicles in a Web of Things Environment
Semantic Technologies for Connected Vehicles in a Web of Things Environment
 
HyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learningHyperTED: exploring video lectures at the fragment levels for enhancing learning
HyperTED: exploring video lectures at the fragment levels for enhancing learning
 
Location Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip RecommendationLocation Embeddings for Next Trip Recommendation
Location Embeddings for Next Trip Recommendation
 
Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014Contextualizing Events in TV News Shows - SNOW 2014
Contextualizing Events in TV News Shows - SNOW 2014
 
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
Modeling Geometry and Reference Systems on the Web of Data - LGD 2014
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...
 
Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013Deep-linking into Media Assets at the Fragment Level SMAM 2013
Deep-linking into Media Assets at the Fragment Level SMAM 2013
 
Describing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and DescriptionDescribing Media Assets: Media Fragment Specification and Description
Describing Media Assets: Media Fragment Specification and Description
 
Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013Semantics at the multimedia fragment level SSSW 2013
Semantics at the multimedia fragment level SSSW 2013
 
Semantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social webSemantic structuring and linking of event-centric data in the social web
Semantic structuring and linking of event-centric data in the social web
 
Live topic generation from event streams
Live topic generation from event streamsLive topic generation from event streams
Live topic generation from event streams
 
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdMediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
 
EventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance ContentEventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
EventMedia Live: Exploring Events Connections in Real-Time to Enhance Content
 
Extracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social NetworksExtracting Media Items from Multiple Social Networks
Extracting Media Items from Multiple Social Networks
 
Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...
 
MediaEval 2012 SED Opening
MediaEval 2012 SED OpeningMediaEval 2012 SED Opening
MediaEval 2012 SED Opening
 
DeRiVE 2011 workshop opening
DeRiVE 2011 workshop openingDeRiVE 2011 workshop opening
DeRiVE 2011 workshop opening
 
MediaEval 2011 SED Opening
MediaEval 2011 SED OpeningMediaEval 2011 SED Opening
MediaEval 2011 SED Opening
 
ShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting EventsShareIt: Mining SocialMedia Activities for Detecting Events
ShareIt: Mining SocialMedia Activities for Detecting Events
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Linking Entities for Enriching and Structuring Social Media Content

  • 1. Linking Entities for Enriching and Structuring Social Media Content Raphaël Troncy <raphael.troncy@eurecom.fr> @rtroncy
  • 2. 12/04/2016 NLPIT Workshop @ WWW 2016 - 2
  • 3. Extracting and Linking Entities (NER/NEL)  “ Tampa Bay Lightning vs Canadiens in Montreal tonight with @erikmannens #hockey #NHL ” 12/04/2016 NLPIT Workshop @ WWW 2016 - 3 https://www.youtube.com/ watch?v=Rmug-PUyIzI
  • 4. Part of Speech (GATE Twitter POS) Tampa NNP Bay NNP Lightning NNP vs CC Canadiens NNP in IN Montreal NNP tonight NN with IN @erikmannens USR #hockey HT #NHL HT 12/04/2016 NLPIT Workshop @ WWW 2016 NER: What is NHL? - 4 https://gate.ac.uk/wiki/twitter-postagger.html NEL: Which Montreal are we talking about?
  • 5. What is #NHL? Type Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 - 5 Sports League Organization Place Railway Line
  • 6. What is #NHL? Type Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 - 6 http://schema.org /SportsEvent http://dbpedia.org/ ontology/Event http://schema.org /Organization http://dbpedia.org/ ontology/IceHocke yLeague Different infobox templates
  • 7. Named Entity Recognition (NER) Tampa NNP ORG Bay NNP ORG Lightning NNP ORG vs CC O Canadiens NNP ORG in IN O Montreal NNP LOC tonight NN O with IN O @erikmannens USR PER #hockey HT THG #NHL HT ORG 12/04/2016 NLPIT Workshop @ WWW 2016 - 7
  • 8. What is Montreal? Name Ambiguity 12/04/2016 NLPIT Workshop @ WWW 2016 Montréal, Ardèche Montréal, Aude Montréal, Gers Montreal, Wisconsin Mont-ral, Catalonia - 8
  • 9. Named Entity Linking (NEL) Tampa NNP ORG Bay NNP ORG Lightning NNP ORG vs CC O Canadiens NNP ORG in IN O Montreal NNP LOC http://dbpedia.org/resource/Montreal tonight NN O with IN O @erikmannens USR PER NIL #hockey HT THG #NHL HT ORG 12/04/2016 NLPIT Workshop @ WWW 2016 - 9
  • 10. NERD: a framework for comparing NER APIs  NER Stanford CoreNLP  Web APIs http://nerd.eurecom.fr/ 12/04/2016 NLPIT Workshop @ WWW 2016 - 10
  • 11. NERD: AlchemyAPI 12/04/2016 NLPIT Workshop @ WWW 2016 - 11 Incorrect boundaries No disambiguation No dereferencing for @mention
  • 12. NERD: Dandelion 12/04/2016 NLPIT Workshop @ WWW 2016 - 12 Everything is a Thing No dereferencing for @mention
  • 13. NERDML 12/04/2016 NLPIT Workshop @ WWW 2016 - 13 No dereferencing for @mention
  • 14. Research Questions  How to adapt an entity linking system depending on different criteria?  How to design an entity linking system in order to be able to process a large amount of data in near real time? 12/04/2016 NLPIT Workshop @ WWW 2016 - 14
  • 15. ADEL: Adaptive Framework for NER  POS Tagger:  use bidirectional dependency network  combine CMM left to right and right to left  NER:  use CRF with Gibbs sampling (Monte Carlo for approximate inference) to take n words into account instead of only the previous and next one 12/04/2016 NLPIT Workshop @ WWW 2016 - 15
  • 16. ADEL: Overlap Resolution  Detect overlaps among extractors with the boundaries of the entities  Different heuristics can be applied:  Merge: (“United States” and “States of America” => “United States of America”) default behavior  Simple Substring: (“Florence” and “Florence May Harding” => ”Florence” and “May Harding”)  Smart Substring: (”Giants of New York” and “New York” => “Giants” and “New York”) 12/04/2016 NLPIT Workshop @ WWW 2016 - 16
  • 17. ADEL: KB Indexing  Create index from DBpedia and Wikipedia  Integrate external data such as PageRank and HITS scores from Hasso Platner Institute 12/04/2016 NLPIT Workshop @ WWW 2016 - 17
  • 18. ADEL: Adaptive Framework for NEL  Generate candidate links for all extracted mentions:  If any, they go to the linking method  If not, they are linked to NIL  Linking method:  ADEL linear formula: r(l): the score of the candidate l L: the Levenshtein distance m: the extracted mention title: the title of the candidate l R: the set of redirect pages associated to the candidate l D: the set of disambiguation pages associated to the candidate l PR: Pagerank associated to the candidate l a, b and c are weights following the properties: a > b > c and a + b + c = 1 12/04/2016 NLPIT Workshop @ WWW 2016 - 18
  • 19. ADEL: Pruning for NER/NEL  k-NN machine learning algorithm  Why a pruning module?  Useful to correct the errors from the extractor by removing wrong annotations. Example:  France played against Russia for a friendly match  Yesterday, I went to see Against in concert  Useful to adapt the annotations in order to follow a given guideline Example: suppose we are participating to two different challenges, the first one count the dates as entities, and the second one does not  NEEL challenge: Jimmy Page was born the January 9th, 1944.  OKE challenge: Jimmy Page was born the January 9th, 1944. 12/04/2016 NLPIT Workshop @ WWW 2016 - 19
  • 20. ADEL Evaluation  #Micropost2014 NEEL Challenge – ADEL v1  #Micropost2015 NEEL Challenge – ADEL v1  #Micropost2016 NEEL Challenge – ADEL v2  OKE2015 Challenge – ADEL v1  OKE2016 Challenge – ADEL v2 E2E UTwente DataTXT ADEL AIDA Hyberabad SAP F- measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02 ADEL FOX FRED F- measure 60.75 49.88 34.73 ousia acubelab ADEL uniba ualberta uva cen_neel F- measure 76.2 52.3 47.9 46.4 41.5 31.6 0 ADEL F- measure 78.8 ADEL F- measure 56.5 12/04/2016 NLPIT Workshop @ WWW 2016 - 20
  • 21. ADEL Live Demo 12/04/2016 NLPIT Workshop @ WWW 2016 - 21
  • 22. Social Media: some definitions  Media Item: a photo or a video that is shared on a social network  Micropost: a text status message that can optionally accompany a media item  Social Network: an online service that focuses on building and reflecting social relationships among people sharing interests or activities Media Sharing Platforms: emphasis on sharing media but blurred boundaries with social networks since users are encouraged to react on media content (like, comment, favorite, etc.) NLPIT Workshop @ WWW 201612/04/2016 - 22
  • 23. Media Server  Composition of media item extractors (12 SNs)  Rely on search APIs + a fix 30s timeout window to provide results  Fallback on screen scraping when necessary (Twitter ecosystem)  Implemented as a NodeJS server  Serialize results in a common schema (JSON) NLPIT Workshop @ WWW 201612/04/2016 - 23 https://github.com/tomayac/media-server
  • 24. 12/04/2016 NLPIT Workshop @ WWW 2016 Deep link Permalink Clean text for NLP processing Aggregate view of ALL social interactions 12 Social Networks
  • 25. Media Finder (www2013) 12/04/2016 NLPIT Workshop @ WWW 2016 - 25
  • 26. Media Finder (zooming on media items) 12/04/2016 NLPIT Workshop @ WWW 2016 - 26
  • 27. Media Finder (timeline view) 12/04/2016 NLPIT Workshop @ WWW 2016 - 27
  • 28. Media Finder Architecture  Media items harvesting using the Media Server http://eventmedia.eurecom.fr/media- server/search/{combined}/{term} https://github.com/vuknje/media-server (@tomayac fork)  Image near de-duplication DCT signature on image and video frame, Hamming distance between image pairs  Clustering and disambiguation Named Entity Extraction using NERD Topic Generation using LDA 12/04/2016 NLPIT Workshop @ WWW 2016 - 28
  • 29. Media Finder (named entities clustering) 12/04/2016 NLPIT Workshop @ WWW 2016 - 29
  • 30. Media Finder (zooming in a cluster) 12/04/2016 NLPIT Workshop @ WWW 2016 - 30
  • 31. Media Finder  Live Topic Generation from Event Streams Published at WWW 2013 Demo Track http://www.youtube.com/watch?v=8iRiwz7cDYY 12/04/2016 NLPIT Workshop @ WWW 2016 - 31
  • 32. Tracking an event: Italian Election  Repeated queries over a period of time We have tracked and analyzed media posts tagged as elezioni2013 from 2013-02-26 to 2013-03-03 Cron job: every 30 minutes over the 6 days Slice the data in 24 hours slots  Research questions: Can we re-create the news headlines?  Storyboarding: http://mediafinder.eurecom.fr/story/elezioni2013 12/04/2016 NLPIT Workshop @ WWW 2016 - 32
  • 33. Tracking an event: Italian Election  Dataset: ~16501 microposts containing (duplicate) media items ~21087 Named Entities extracted  Clustering NER and LDA Generate Bag of Entities (BOE) disambiguated with a DBpedia URI  Examples: Monti, Bersani, Italia, Berlusconi, Grillo, Stelle 12/04/2016 NLPIT Workshop @ WWW 2016 - 33
  • 34. Tracking an event: Italian Election  Tracking and Analyzing The 2013 Italian Election Published at ESWC 2013 Demo Track http://www.youtube.com/watch?v=jIMdnwMoWnk 12/04/2016 NLPIT Workshop @ WWW 2016 - 34
  • 37. “This is Nikita, a security guard from one of the bars in St. Petersburg.” “This is Nikita, a security guard from one of the bars in St. Petersburg.” NER Example taken from the transcript of https://www.ted.com/talks/2089 PERSON FUNCTION LOCATION Category: type in the NER task. Natural Language Processing (NPL) Task  disambiguating URL in a knowledge base. E.g. http://dbpedia.org/resource/Saint_P etersburg. Annotations: Named Entities
  • 38. 1. Clustering of consecutive chapters which talk about similar topics and entities 2. Ordering of those fragments based on annotation relevance (TF-IDF) 3. Filtering: Hot Spots are fragments whose relative relevance falls under the first quarter of the final score distribution MF: Hot Spots Hot Spot 1 Chapters Hot Spot 2 Hot Spots