Talk given at the VU University Amsterdam, NL - February 6, 2013
Abstract: Since the advent of Linked Data, we have observed a dramatic increase of structured data sources published on the Web. They provide mainly entity to entity interconnections, resulting in a Web of Linked Entities, disambiguated through URIs, spanning structured and unstructured data. Several efforts have been made to exploit such a mine of information for enhancing text understanding, by connecting pieces of text to real world objects, i.e. entities, that are easily discoverable by intelligent agents, resulting in a proliferation of different systems for text annotation through "Web" entities.
In this perspective, we have developed a framework for harmonizing the access to such systems and their output results. The NERD ontology [1] aligns the difference in the annotations and provide a definition for a set of axioms taken from the long tail distribution of common classes among the used extractors. Powered on top of the NERD ontology, we have developed NERD [2] which implements a combined logic that looks for minimizing the error of annotation taking the best, when possible, from these extractors. We have observed that the well-known entity classes, such as Person, Location, Organization are well covered from these extractors, while Event is less, mainly due to a lack of definition and knowledge about what are events. As a follow-up of the Eventmedia project [3], we are defining an event spotter which takes advantage from the large event graph knowledge described in the Eventmedia dataset [4].
Sources of structured and unstructured data are also social platforms. They constantly record streams of heterogeneous data about human’s activities, feelings, emotions, conversations opening a window to the world in real-time. Making sense out of these streams is extremely challenging. We are currently investigating the role of named entities as centroids for micropost topic generations, presenting
them through visual galleries.
[1] - http://nerd.eurecom.fr/ontology
[2] - http://nerd.eurecom.fr
[3] - http://eventmedia.eurecom.fr
[4] - http://eventmedia.eurecom.fr/sparql
2. Spotting entities while reading a
document
➢
Name of People,
Locations,
Organizations,
etc..
➢
Named entities are
fundamental keys
for topic
understanding
➢
But, the same
location can refer source: http://goo.gl/kVzlK
to different places
Ferbruary 6, 2013 VU University Amsterdam, NL 2/22
3. A Web of Linked Entities
➢
GGG (global giant graph)
http://goo.gl/fH3h
➢
Nodes are Web entities
source: http://wole2013.eurecom.fr ➢
Entities provide
disambiguation pointers
➢
Entities can be univocally
referred (disambiguated)
➢
Entities as centroids for topic
generation and undestanding
source: http://wole2012.eurecom.fr
Ferbruary 6, 2013 VU University Amsterdam, NL 3/22
4. on
Entity extractors
I ati
UR gu
bi
I
AP
am
eb
is
W
D
Ferbruary 6, 2013 VU University Amsterdam, NL 4/22
5. Diversity
Alchemy DBpedia Extractiv Lupedia Open Saplo Semi Wikimeta Yahoo! Zemanta
API Spotlight Calais Tags
Language EN,FR, EN EN EN,FR, EN,FR EN, DE, EN,FR EN EN
DE,IT, IT SP SW NL SP
PT,RU,
SP,SW
Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED
Entity N/A char word range of char N/A char POS range N/A
position offset offset chars offset offset offset of
chars
Classification Alchemy DBpedia Extractiv DBpedia Open Saplo ConLL ESTER Yahoo FreeBase
schema FreeBase LinkedM Calais -3
Scema.or DB
g
Number of 324 320 34 319 95 5 4 7 13 81
classes
Response JSON HTML HTML HTML JSON JSON XML JSON JSON XML
Format MicroF JSON JSON JSON MicroF XML XML JSON
XML RDF RDF RDFa ormat RDF
RDF XML XML XML
Quota 30000 unl 3000 unl 50000 1333 unl unl 5000 10000
(calls/day)
Ferbruary 6, 2013 VU University Amsterdam, NL 5/22
7. NERD Ontology NERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...
The NERD ontology has been integrated in the NIF project, a EU FP7 in
the context of the LOD2: Creating Knowledge out of Interlinked Data
Ferbruary 6, 2013 VU University Amsterdam, NL 7/22
9. We can do better: combined 2
201
A PE
ET
extraction
(eA1,tA1,URIA1,siA1,eiA1) ... ... ... cleaning
(eA2,tA2,URIA2,siA2,eiA2)
(eA3,tA3,URIA3,siA3,eiA3)
fusion
When at least 2 extractors
(eN1,tN1,URIN1,siN1,eiN1) classify the same entity with a
(eN2,tN2,URIN2,siN2,eiN2) different type then we apply a
preferred selection order (learning
rules): Wikimeta, AlchemyAPI,
OpenCalais, Lupedia
Ferbruary 6, 2013 VU University Amsterdam, NL 9/22
10. … but it introduced systematic
errors 201
2
A PE
ET
SLR (Slot prec recall F1 %correct
Error Rate)
alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%
lupedia 39.49% 22.87% 1.56% 2.91% 1.56%
opencalais 37.47% 41.69% 3.53% 6.49% 3.53%
wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%
combined 86.85% 35.31% 17.69% 23.44% 17.69%
(nerd)
Ferbruary 6, 2013 VU University Amsterdam, NL 10/22
11. Gazetteers: combined+ 2
201
A PE
ET
...
Learned model POS tagger
Created (eA1,tA1,URIA1,siA1,eA1)
Apply rules
static rules (eA2,tA2,URIA2,siA2,eiA2)
fusion
(e1,t1,URI1,si1,ei1)
Conflicts handled by
priority selection:own,
Wikimeta,AlchemyAPI,
OpenCalais,Lupedia
(eN1,tN1,URIN1,sN1,eN1)
`
Ferbruary 6, 2013 VU University Amsterdam, NL 11/22
12. Over-estimated training model 2
201
A PE
ET
SLR (Slot prec recall F1 %correct
Error Rate)
combined 86.85% 35.31% 17.69% 23.44% 17.69%
combined+ 188.81% 15.13% 28.40% 19.45% 28.40%
Ferbruary 6, 2013 VU University Amsterdam, NL 12/22
13. General NER limitations
➢
Perfomances drop
– with common settings using off-the-shelf
models, while annotating corpora which
differs from the training model (empirically
recall drops of ~20%)
– with noisy texts such as transcripts, microposts
➢
Lack of knowledge for particular
categories, in particular Event
Ferbruary 6, 2013 VU University Amsterdam, NL 13/22
14. Participation at the #MSM2013
challenge
in g
➢
English Twitter posts go
– Train: 2815 posts on
– Eval: 1526 posts
➢
Annotation schema: 4 classes
➢
Objective: perform better than the Stanford CFR,
properly trained with the challenge settings
prec recall F1
LOC 80.12% 57.76% 67.63%
MISC 68.18% 31.51% 43.10%
ORG 83.28% 50.71% 63.04%
PER 79.93% 70.72% 75.04%
4-fold cross validation over training - provisional results
of the Stanford CFR
Ferbruary 6, 2013 VU University Amsterdam, NL 14/22
15. Poor performances of spotting
events
➢
Exploit large domain knowledge
represented by the Eventmedia dataset1
➢
EventSpotter
– Entities classified according to the LODE ontology
– Spotting according to the event name, agents,
temporal and geo spatial information
– Confidence computed according to the similarity
of the surrounding text where the entity has
been spotted and the event description
– Disambiguation provided by the event URIs (nodes
of the Eventmedia graph)
1
http://eventmedia.eurecom.fr/sparql
Ferbruary 6, 2013 VU University Amsterdam, NL 15/22
16. Entities for concept mining
➢
Used to annotate textual data
– news articles, and ...
➢
Video transcripts:
– video segmentation (MediaFragment)
– MediaFragment annotation
– indexing
– topic model generation
➢
Microposts:
– text understanding
– topic model generation
Ferbruary 6, 2013 VU University Amsterdam, NL 16/22
17. Media Fragment Enricher
joint work between University of
source: http://goo.gl/BMZK3
Southampton and EURECOM
Ferbruary 6, 2013 VU University Amsterdam, NL 17/22
18. Annotating social streams
➢
Live and fresh breaking news: microposts
➢
Media items, such as pictures and videos,
usually are attached to microposts
➢
Grouping microposts:
– Entity labels
– Entity classes
– Latent Dirichlet allocation (LDA)
– Density based micropost proximity (text similarity,
entity similarity, temporal distance)
➢
Create textual storyboards from vox populi
➢
Describe visually the created storyboards
Ferbruary 6, 2013 VU University Amsterdam, NL 18/22
19. Centroids for topic
generation
➢
Each cloud represents
a topic
➢
A topic is depicted by
an entity
➢
Leaf are media items,
which visually
represent the
microposts
➢
Each leaf can belong
to many topics
Ferbruary 6, 2013 VU University Amsterdam, NL 19/22
20. Topic storyboard
➢
Visual summary of the
topic
➢
Topic is labelled with an
entity
➢
A poster picture is
displayed according to
the relevance of the
micropost in the
generated topic
➢
If the entity points to a
LOD resource, a
textual description is
displayed
Ferbruary 6, 2013 VU University Amsterdam, NL 20/22
21. Outlook
➢
Modelling heterogeneous data with
entities
➢
Linking entities according to the topics
extracted from the text
➢
Enhancing topic modelling with the GGG
➢
Providing visual storyboards tailored
with the extracted topics
Ferbruary 6, 2013 VU University Amsterdam, NL 21/22
22. Thanks for your time and attention
Agenda:
– Web of Linked Entities (sl. 3)
– Aligning annotations (sl. 6)
– Combining performances of 3rd-
party entity extractors (sl. 9)
– Spotting events (sl. 15)
– Annotating MFs and microposts for
topic generation (sl. 16)
– Topic storyboard generation (sl. 19)
http://www.slideshare.net/giusepperizzo
Ferbruary 6, 2013 VU University Amsterdam, NL 22/22