Annotating streams of heterogeneous data for topic generation

Annotating streams of
heterogeneous data for topic
generation
Giuseppe Rizzo
giuseppe.rizzo@eurecom.fr
@giusepperizzo

Spotting entities while reading a
document

➢
Name of People,
Locations,
Organizations,
etc..
➢
Named entities are
fundamental keys
for topic
understanding
➢
But, the same
location can refer source: http://goo.gl/kVzlK
to different places

Ferbruary 6, 2013 VU University Amsterdam, NL 2/22

A Web of Linked Entities

➢
GGG (global giant graph)
http://goo.gl/fH3h
➢
Nodes are Web entities

source: http://wole2013.eurecom.fr ➢
Entities provide
disambiguation pointers
➢
Entities can be univocally
referred (disambiguated)
➢
Entities as centroids for topic
generation and undestanding
source: http://wole2012.eurecom.fr


on
Entity extractors

I ati
UR gu
bi
I
AP

am
eb

is
W

D

Diversity
Alchemy DBpedia Extractiv Lupedia Open Saplo Semi Wikimeta Yahoo! Zemanta
API Spotlight Calais Tags

Language EN,FR, EN EN EN,FR, EN,FR EN, DE, EN,FR EN EN
DE,IT, IT SP SW NL SP
PT,RU,
SP,SW

Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED
Entity N/A char word range of char N/A char POS range N/A
position offset offset chars offset offset offset of
chars

Classification Alchemy DBpedia Extractiv DBpedia Open Saplo ConLL ESTER Yahoo FreeBase
schema FreeBase LinkedM Calais -3
Scema.or DB
g

Number of 324 320 34 319 95 5 4 7 13 81
classes
Response JSON HTML HTML HTML JSON JSON XML JSON JSON XML
Format MicroF JSON JSON JSON MicroF XML XML JSON
XML RDF RDF RDFa ormat RDF
RDF XML XML XML

Quota 30000 unl 3000 unl 50000 1333 unl unl 5000 10000
(calls/day)


Harmonizing annotations

http://nerd.eurecom.fr

ontology1
REST API2
UI3

1
http://nerd.eurecom.fr/ontology
2
http://nerd.eurecom.fr/api/application.wadl
3
http://nerd.eurecom.fr


NERD Ontology NERD type Occurrence
Person 10
Organization 10
Country 6
Company 6
Location 6
Continent 5
City 5
RadioStation 5
Album 5
Product 5
... ...

The NERD ontology has been integrated in the NIF project, a EU FP7 in
the context of the LOD2: Creating Knowledge out of Interlinked Data


ETAPE2012
➢
DGA (French radio transcripts)
– Train: 7h 50m
– Dev: 3h
– Eval: 3h
➢
ELDA (French TV transcripts)
– Train: 18h 10m
– Dev: 7h 55m
– Eval: 7h 55m
➢
Annotation schema Quaero: 32 classes


We can do better: combined 2
201
A PE
ET
extraction

(eA1,tA1,URIA1,siA1,eiA1) ... ... ... cleaning
(eA2,tA2,URIA2,siA2,eiA2)
(eA3,tA3,URIA3,siA3,eiA3)

fusion
When at least 2 extractors
(eN1,tN1,URIN1,siN1,eiN1) classify the same entity with a
(eN2,tN2,URIN2,siN2,eiN2) different type then we apply a
preferred selection order (learning
rules): Wikimeta, AlchemyAPI,
OpenCalais, Lupedia

… but it introduced systematic
errors 201
2

A PE
ET
SLR (Slot prec recall F1 %correct
Error Rate)

alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45%

lupedia 39.49% 22.87% 1.56% 2.91% 1.56%

opencalais 37.47% 41.69% 3.53% 6.49% 3.53%

wikimeta 36.67% 19.40% 4.25% 6.95% 4.25%

combined 86.85% 35.31% 17.69% 23.44% 17.69%
(nerd)


Gazetteers: combined+ 2
201
A PE
ET

...

Learned model POS tagger

Created (eA1,tA1,URIA1,siA1,eA1)
Apply rules
static rules (eA2,tA2,URIA2,siA2,eiA2)
fusion
(e1,t1,URI1,si1,ei1)
Conflicts handled by
priority selection:own,
Wikimeta,AlchemyAPI,
OpenCalais,Lupedia
(eN1,tN1,URIN1,sN1,eN1)
`


Over-estimated training model 2
201
A PE
ET

SLR (Slot prec recall F1 %correct
Error Rate)
combined 86.85% 35.31% 17.69% 23.44% 17.69%

combined+ 188.81% 15.13% 28.40% 19.45% 28.40%


General NER limitations

➢
Perfomances drop
– with common settings using off-the-shelf
models, while annotating corpora which
differs from the training model (empirically
recall drops of ~20%)
– with noisy texts such as transcripts, microposts
➢
Lack of knowledge for particular
categories, in particular Event


Participation at the #MSM2013
challenge
in g
➢
English Twitter posts go
– Train: 2815 posts on
– Eval: 1526 posts
➢
Annotation schema: 4 classes
➢
Objective: perform better than the Stanford CFR,
properly trained with the challenge settings
prec recall F1

LOC 80.12% 57.76% 67.63%

MISC 68.18% 31.51% 43.10%

ORG 83.28% 50.71% 63.04%

PER 79.93% 70.72% 75.04%

4-fold cross validation over training - provisional results
of the Stanford CFR


Poor performances of spotting
events
➢
Exploit large domain knowledge
represented by the Eventmedia dataset1
➢
EventSpotter
– Entities classified according to the LODE ontology
– Spotting according to the event name, agents,
temporal and geo spatial information
– Confidence computed according to the similarity
of the surrounding text where the entity has
been spotted and the event description
– Disambiguation provided by the event URIs (nodes
of the Eventmedia graph)
1
http://eventmedia.eurecom.fr/sparql


Entities for concept mining
➢
Used to annotate textual data
– news articles, and ...
➢
Video transcripts:
– video segmentation (MediaFragment)
– MediaFragment annotation
– indexing
– topic model generation
➢
Microposts:
– text understanding
– topic model generation


Media Fragment Enricher

joint work between University of
source: http://goo.gl/BMZK3
Southampton and EURECOM

Annotating social streams
➢
Live and fresh breaking news: microposts
➢
Media items, such as pictures and videos,
usually are attached to microposts
➢
Grouping microposts:
– Entity labels
– Entity classes
– Latent Dirichlet allocation (LDA)
– Density based micropost proximity (text similarity,
entity similarity, temporal distance)
➢
Create textual storyboards from vox populi
➢
Describe visually the created storyboards

Centroids for topic
generation
➢
Each cloud represents
a topic
➢
A topic is depicted by
an entity
➢
Leaf are media items,
which visually
represent the
microposts
➢
Each leaf can belong
to many topics


Topic storyboard
➢
Visual summary of the
topic
➢
Topic is labelled with an
entity
➢
A poster picture is
displayed according to
the relevance of the
micropost in the
generated topic
➢
If the entity points to a
LOD resource, a
textual description is
displayed

Outlook
➢
Modelling heterogeneous data with
entities
➢
Linking entities according to the topics
extracted from the text
➢
Enhancing topic modelling with the GGG
➢
Providing visual storyboards tailored
with the extracted topics


Thanks for your time and attention

Agenda:
– Web of Linked Entities (sl. 3)
– Aligning annotations (sl. 6)
– Combining performances of 3rd-
party entity extractors (sl. 9)
– Spotting events (sl. 15)
– Annotating MFs and microposts for
topic generation (sl. 16)
– Topic storyboard generation (sl. 19)

http://www.slideshare.net/giusepperizzo


Annotating streams of heterogeneous data for topic generation

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

More from Giuseppe Rizzo

More from Giuseppe Rizzo (20)

Recently uploaded

Recently uploaded (20)

Annotating streams of heterogeneous data for topic generation