SlideShare a Scribd company logo
1 of 22
Download to read offline
Annotating streams of
heterogeneous data for topic
        generation
          Giuseppe Rizzo
        giuseppe.rizzo@eurecom.fr
             @giusepperizzo
Spotting entities while reading a
  document

 ➢
     Name of People,
      Locations,
      Organizations,
      etc..
 ➢
     Named entities are
      fundamental keys
      for topic
      understanding
 ➢
     But, the same
      location can refer source: http://goo.gl/kVzlK
      to different places


Ferbruary 6, 2013        VU University Amsterdam, NL   2/22
A Web of Linked Entities

                                         ➢
                                             GGG (global giant graph)
                                              http://goo.gl/fH3h
                                         ➢
                                             Nodes are Web entities

  source: http://wole2013.eurecom.fr     ➢
                                             Entities provide
                                              disambiguation pointers
                                         ➢
                                             Entities can be univocally
                                              referred (disambiguated)
                                         ➢
                                             Entities as centroids for topic
                                              generation and undestanding
  source: http://wole2012.eurecom.fr




Ferbruary 6, 2013                VU University Amsterdam, NL    3/22
on
  Entity extractors




                                                              I ati
                                                            UR gu
                                                              bi
                                                       I
                                                    AP


                                                           am
                                                   eb



                                                           is
                                                  W



                                                           D
Ferbruary 6, 2013   VU University Amsterdam, NL     4/22
Diversity
                       Alchemy   DBpedia     Extractiv    Lupedia      Open         Saplo   Semi     Wikimeta     Yahoo!   Zemanta
                         API     Spotlight                             Calais               Tags

Language               EN,FR,      EN           EN        EN,FR,       EN,FR        EN,      DE,       EN,FR       EN        EN
                       DE,IT,                               IT          SP          SW       NL         SP
                       PT,RU,
                       SP,SW

Granularity             OEN       OEN          OEN          OEN            OEN      OED     OED         OEN       OEN       OED
Entity                  N/A       char        word        range of         char     N/A     char       POS        range      N/A
position                          offset      offset       chars           offset           offset     offset       of
                                                                                                                  chars

Classification         Alchemy   DBpedia     Extractiv    DBpedia      Open         Saplo   ConLL     ESTER       Yahoo    FreeBase
schema                           FreeBase                 LinkedM      Calais                 -3
                                 Scema.or                    DB
                                     g

Number of               324        320          34          319             95       5        4             7      13        81
classes
Response               JSON       HTML        HTML         HTML        JSON         JSON    XML        JSON       JSON       XML
Format                 MicroF     JSON        JSON         JSON        MicroF                           XML        XML      JSON
                        XML        RDF         RDF         RDFa        ormat                                                 RDF
                        RDF        XML         XML          XML


Quota                  30000       unl         3000          unl       50000        1333     unl            unl   5000      10000
(calls/day)

   Ferbruary 6, 2013                         VU University Amsterdam, NL                             5/22
Harmonizing annotations




                                              http://nerd.eurecom.fr




                             ontology1
                             REST API2
                             UI3

1
    http://nerd.eurecom.fr/ontology
2
    http://nerd.eurecom.fr/api/application.wadl
3
    http://nerd.eurecom.fr


Ferbruary 6, 2013                    VU University Amsterdam, NL       6/22
NERD Ontology                                                   NERD type      Occurrence
                                                                  Person                 10
                                                                  Organization           10
                                                                  Country                  6
                                                                  Company                  6
                                                                  Location                 6
                                                                  Continent                5
                                                                  City                     5
                                                                  RadioStation             5
                                                                  Album                    5
                                                                  Product                  5
                                                                  ...                     ...




                    The NERD ontology has been integrated in the NIF project, a EU FP7 in
                    the context of the LOD2: Creating Knowledge out of Interlinked Data

Ferbruary 6, 2013             VU University Amsterdam, NL         7/22
ETAPE2012
 ➢
     DGA (French radio transcripts)
                    – Train: 7h 50m
                    – Dev: 3h
                    – Eval: 3h
 ➢
     ELDA (French TV transcripts)
                    – Train: 18h 10m
                    – Dev: 7h 55m
                    – Eval: 7h 55m
 ➢
     Annotation schema Quaero: 32 classes


Ferbruary 6, 2013             VU University Amsterdam, NL   8/22
We can do better: combined                                                                      2
                                                                                               201
                                                                                          A PE
                                                                                        ET
                                                                          extraction




 (eA1,tA1,URIA1,siA1,eiA1) ...                    ...           ...        cleaning
 (eA2,tA2,URIA2,siA2,eiA2)
 (eA3,tA3,URIA3,siA3,eiA3)

                                                                            fusion
                                                               When at least 2 extractors
                      (eN1,tN1,URIN1,siN1,eiN1)                classify the same entity with a
                      (eN2,tN2,URIN2,siN2,eiN2)                different type then we apply a
                                                               preferred selection order (learning
                                                               rules): Wikimeta, AlchemyAPI,
                                                               OpenCalais, Lupedia
Ferbruary 6, 2013                VU University Amsterdam, NL               9/22
… but it introduced systematic
    errors                                                                                  201
                                                                                               2

                                                                                       A PE
                                                                                     ET
                     SLR (Slot              prec                 recall     F1      %correct
                    Error Rate)

alchemyapi           37.71%              47.95%                 5.45%     9.68%      5.45%


lupedia              39.49%              22.87%                 1.56%     2.91%      1.56%


opencalais           37.47%              41.69%                 3.53%     6.49%      3.53%


wikimeta             36.67%              19.40%                 4.25%     6.95%      4.25%


combined             86.85%              35.31%                 17.69%    23.44%    17.69%
(nerd)



Ferbruary 6, 2013                 VU University Amsterdam, NL               10/22
Gazetteers: combined+                                                                            2
                                                                                                201
                                                                                           A PE
                                                                                         ET

                                                                     ...


  Learned model          POS tagger


         Created                              (eA1,tA1,URIA1,siA1,eA1)
                         Apply rules
       static rules                           (eA2,tA2,URIA2,siA2,eiA2)
                                                                                    fusion
                      (e1,t1,URI1,si1,ei1)
                                                                           Conflicts handled by
                                                                           priority selection:own,
                                                                           Wikimeta,AlchemyAPI,
                                                                           OpenCalais,Lupedia
                            (eN1,tN1,URIN1,sN1,eN1)
                                `

Ferbruary 6, 2013           VU University Amsterdam, NL                     11/22
Over-estimated training model                                                             2
                                                                                         201
                                                                                    A PE
                                                                                  ET




                     SLR (Slot           prec                 recall     F1      %correct
                    Error Rate)
 combined            86.85%           35.31%                 17.69%    23.44%    17.69%

 combined+           188.81%           15.13%                28.40%    19.45%    28.40%




Ferbruary 6, 2013              VU University Amsterdam, NL               12/22
General NER limitations

    ➢
         Perfomances drop
                    – with common settings using off-the-shelf
                       models, while annotating corpora which
                       differs from the training model (empirically
                       recall drops of ~20%)
                    – with noisy texts such as transcripts, microposts
    ➢
         Lack of knowledge for particular
         categories, in particular Event




Ferbruary 6, 2013             VU University Amsterdam, NL   13/22
Participation at the #MSM2013
  challenge
                                                                                            in g
 ➢
     English Twitter posts                                                                go
              – Train: 2815 posts                                                    on
              – Eval: 1526 posts
 ➢
     Annotation schema: 4 classes
 ➢
     Objective: perform better than the Stanford CFR,
     properly trained with the challenge settings
                                       prec                recall     F1

                         LOC         80.12%              57.76%     67.63%

                        MISC         68.18%               31.51%    43.10%

                        ORG          83.28%              50.71%     63.04%

                         PER         79.93%              70.72%     75.04%

                    4-fold cross validation over training - provisional results
                    of the Stanford CFR

Ferbruary 6, 2013                  VU University Amsterdam, NL               14/22
Poor performances of spotting
  events
 ➢
       Exploit large domain knowledge
       represented by the Eventmedia dataset1
 ➢
       EventSpotter
                    – Entities classified according to the LODE ontology
                    – Spotting according to the event name, agents,
                        temporal and geo spatial information
                    – Confidence computed according to the similarity
                        of the surrounding text where the entity has
                        been spotted and the event description
                    – Disambiguation provided by the event URIs (nodes
                        of the Eventmedia graph)
   1
       http://eventmedia.eurecom.fr/sparql

Ferbruary 6, 2013                    VU University Amsterdam, NL   15/22
Entities for concept mining
 ➢
     Used to annotate textual data
                    – news articles, and ...
 ➢
     Video transcripts:
                    –   video segmentation (MediaFragment)
                    –   MediaFragment annotation
                    –   indexing
                    –   topic model generation
 ➢
     Microposts:
                    – text understanding
                    – topic model generation

Ferbruary 6, 2013               VU University Amsterdam, NL   16/22
Media Fragment Enricher




                                                       joint work between University of
source: http://goo.gl/BMZK3
                                                         Southampton and EURECOM
 Ferbruary 6, 2013       VU University Amsterdam, NL            17/22
Annotating social streams
 ➢
     Live and fresh breaking news: microposts
 ➢
     Media items, such as pictures and videos,
        usually are attached to microposts
 ➢
     Grouping microposts:
                –   Entity labels
                –   Entity classes
                –   Latent Dirichlet allocation (LDA)
                –   Density based micropost proximity (text similarity,
                      entity similarity, temporal distance)
 ➢
     Create textual storyboards from vox populi
 ➢
     Describe visually the created storyboards
Ferbruary 6, 2013             VU University Amsterdam, NL   18/22
Centroids for topic
generation
 ➢
     Each cloud represents
     a topic
 ➢
     A topic is depicted by
     an entity
 ➢
     Leaf are media items,
     which visually
     represent the
     microposts
 ➢
     Each leaf can belong
     to many topics


Ferbruary 6, 2013   VU University Amsterdam, NL   19/22
Topic storyboard
 ➢
     Visual summary of the
     topic
 ➢
     Topic is labelled with an
      entity
 ➢
     A poster picture is
     displayed according to
     the relevance of the
     micropost in the
     generated topic
 ➢
     If the entity points to a
      LOD resource, a
      textual description is
      displayed
Ferbruary 6, 2013    VU University Amsterdam, NL   20/22
Outlook
 ➢
     Modelling heterogeneous data with
     entities
 ➢
     Linking entities according to the topics
     extracted from the text
 ➢
     Enhancing topic modelling with the GGG
 ➢
     Providing visual storyboards tailored
     with the extracted topics




Ferbruary 6, 2013   VU University Amsterdam, NL   21/22
Thanks for your time and attention

                    Agenda:
                        – Web of Linked Entities (sl. 3)
                        – Aligning annotations (sl. 6)
                        – Combining performances of 3rd-
                            party entity extractors (sl. 9)
                        – Spotting events (sl. 15)
                        – Annotating MFs and microposts for
                            topic generation (sl. 16)
                        – Topic storyboard generation (sl. 19)


                             http://www.slideshare.net/giusepperizzo


Ferbruary 6, 2013              VU University Amsterdam, NL     22/22

More Related Content

Viewers also liked

Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Raphael Troncy
 
Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...
Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...
Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...LinkedTV
 
Video Hyperlinking Tutorial (Part A)
Video Hyperlinking Tutorial (Part A)Video Hyperlinking Tutorial (Part A)
Video Hyperlinking Tutorial (Part A)LinkedTV
 
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...LinkedTV
 
LinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV
 
LinkedTV - Crossmedia beim rbb
LinkedTV - Crossmedia beim rbbLinkedTV - Crossmedia beim rbb
LinkedTV - Crossmedia beim rbbNico_deAbreu
 
NoTube project results. Bringing TV and Web together.
NoTube project results. Bringing TV and Web together. NoTube project results. Bringing TV and Web together.
NoTube project results. Bringing TV and Web together. MODUL Technology GmbH
 
MVNO Consulting Services
MVNO Consulting ServicesMVNO Consulting Services
MVNO Consulting ServicesYOZZO
 
Thailand's Telecom Market end of 2015 ★
Thailand's Telecom Market end of 2015 ★Thailand's Telecom Market end of 2015 ★
Thailand's Telecom Market end of 2015 ★YOZZO
 
Binde27nasioanal
Binde27nasioanalBinde27nasioanal
Binde27nasioanalepaper
 
The entertainers ch 3 and ch 7
The entertainers   ch 3 and ch 7The entertainers   ch 3 and ch 7
The entertainers ch 3 and ch 7Jill Falk
 
1009nas
1009nas1009nas
1009nasepaper
 
The SentiME System at the SSA Challenge Task 1
The SentiME System at the SSA Challenge Task 1The SentiME System at the SSA Challenge Task 1
The SentiME System at the SSA Challenge Task 1Giuseppe Rizzo
 
Edisi Medan04
Edisi Medan04Edisi Medan04
Edisi Medan04epaper
 
Waspada Aceh 21 Agustus 2009
Waspada Aceh 21 Agustus 2009 Waspada Aceh 21 Agustus 2009
Waspada Aceh 21 Agustus 2009 epaper
 
PROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAK
PROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAKPROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAK
PROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAKGung Thata
 

Viewers also liked (19)

Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...Semantics at the multimedia fragment level or how enabling the remixing of on...
Semantics at the multimedia fragment level or how enabling the remixing of on...
 
Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...
Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...
Remixing Media on the Semantic Web (ISWC 2014 Tutorial) Pt 1 Media Fragment S...
 
Video Hyperlinking Tutorial (Part A)
Video Hyperlinking Tutorial (Part A)Video Hyperlinking Tutorial (Part A)
Video Hyperlinking Tutorial (Part A)
 
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
 
LinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providers
 
LinkedTV - Crossmedia beim rbb
LinkedTV - Crossmedia beim rbbLinkedTV - Crossmedia beim rbb
LinkedTV - Crossmedia beim rbb
 
NoTube project results. Bringing TV and Web together.
NoTube project results. Bringing TV and Web together. NoTube project results. Bringing TV and Web together.
NoTube project results. Bringing TV and Web together.
 
News Semantic Snapshot
News Semantic SnapshotNews Semantic Snapshot
News Semantic Snapshot
 
HbbTV Introduction
HbbTV IntroductionHbbTV Introduction
HbbTV Introduction
 
MVNO Consulting Services
MVNO Consulting ServicesMVNO Consulting Services
MVNO Consulting Services
 
Thailand's Telecom Market end of 2015 ★
Thailand's Telecom Market end of 2015 ★Thailand's Telecom Market end of 2015 ★
Thailand's Telecom Market end of 2015 ★
 
Binde27nasioanal
Binde27nasioanalBinde27nasioanal
Binde27nasioanal
 
The entertainers ch 3 and ch 7
The entertainers   ch 3 and ch 7The entertainers   ch 3 and ch 7
The entertainers ch 3 and ch 7
 
Rm3-A Device
Rm3-A DeviceRm3-A Device
Rm3-A Device
 
1009nas
1009nas1009nas
1009nas
 
The SentiME System at the SSA Challenge Task 1
The SentiME System at the SSA Challenge Task 1The SentiME System at the SSA Challenge Task 1
The SentiME System at the SSA Challenge Task 1
 
Edisi Medan04
Edisi Medan04Edisi Medan04
Edisi Medan04
 
Waspada Aceh 21 Agustus 2009
Waspada Aceh 21 Agustus 2009 Waspada Aceh 21 Agustus 2009
Waspada Aceh 21 Agustus 2009
 
PROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAK
PROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAKPROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAK
PROSES DAN TEKNIK PENDEKATAN DALAM PERANCANGAN TATA LETAK
 

More from Giuseppe Rizzo

Artificial intelligence for social good
Artificial intelligence for social goodArtificial intelligence for social good
Artificial intelligence for social goodGiuseppe Rizzo
 
COMPRENDE, PERSONALIZZA, INTERAGISCE E IMPARA: L’AI COGNITIVA PER L’HR
COMPRENDE, PERSONALIZZA, INTERAGISCE E  IMPARA: L’AI COGNITIVA PER L’HRCOMPRENDE, PERSONALIZZA, INTERAGISCE E  IMPARA: L’AI COGNITIVA PER L’HR
COMPRENDE, PERSONALIZZA, INTERAGISCE E IMPARA: L’AI COGNITIVA PER L’HRGiuseppe Rizzo
 
Understand, Answer and Argument: Conversational Agents
Understand, Answer and Argument: Conversational AgentsUnderstand, Answer and Argument: Conversational Agents
Understand, Answer and Argument: Conversational AgentsGiuseppe Rizzo
 
AI For Profiling Your Customers
AI For Profiling Your CustomersAI For Profiling Your Customers
AI For Profiling Your CustomersGiuseppe Rizzo
 
AI for Personalized Chatbot
AI for Personalized ChatbotAI for Personalized Chatbot
AI for Personalized ChatbotGiuseppe Rizzo
 
Tourist Knowledge Graph Creation to Automating Travel Bookings
Tourist Knowledge Graph Creation to Automating Travel BookingsTourist Knowledge Graph Creation to Automating Travel Bookings
Tourist Knowledge Graph Creation to Automating Travel BookingsGiuseppe Rizzo
 
Context-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity LinkingContext-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity LinkingGiuseppe Rizzo
 
From Data to Knowledge for Tourists
From Data to Knowledge for TouristsFrom Data to Knowledge for Tourists
From Data to Knowledge for TouristsGiuseppe Rizzo
 
Enabling Visitors to Explore a Smart City
Enabling Visitors to Explore a Smart CityEnabling Visitors to Explore a Smart City
Enabling Visitors to Explore a Smart CityGiuseppe Rizzo
 
NEEL2015 challenge summary
NEEL2015 challenge summaryNEEL2015 challenge summary
NEEL2015 challenge summaryGiuseppe Rizzo
 
Inductive Entity Typing Alignment
Inductive Entity Typing AlignmentInductive Entity Typing Alignment
Inductive Entity Typing AlignmentGiuseppe Rizzo
 
Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...
Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...
Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...Giuseppe Rizzo
 
CrossLanguageSpotter: A Library for Detecting Relations in Polyglot Frameworks
CrossLanguageSpotter: A Library for Detecting Relations in Polyglot FrameworksCrossLanguageSpotter: A Library for Detecting Relations in Polyglot Frameworks
CrossLanguageSpotter: A Library for Detecting Relations in Polyglot FrameworksGiuseppe Rizzo
 
Learning with the Web. Structuring data to ease machine understanding
Learning with the Web. Structuring data to ease  machine understandingLearning with the Web. Structuring data to ease  machine understanding
Learning with the Web. Structuring data to ease machine understandingGiuseppe Rizzo
 
Learning with the Web: Spotting Named Entities on the intersection of NERD an...
Learning with the Web: Spotting Named Entities on the intersection of NERD an...Learning with the Web: Spotting Named Entities on the intersection of NERD an...
Learning with the Web: Spotting Named Entities on the intersection of NERD an...Giuseppe Rizzo
 
NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud
NERD meets NIF:  Lifting NLP Extraction Results to the Linked Data CloudNERD meets NIF:  Lifting NLP Extraction Results to the Linked Data Cloud
NERD meets NIF: Lifting NLP Extraction Results to the Linked Data CloudGiuseppe Rizzo
 
L'enorme archivio di dati: il Web
L'enorme archivio di dati: il WebL'enorme archivio di dati: il Web
L'enorme archivio di dati: il WebGiuseppe Rizzo
 
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity Recognition Tools in the Web of DataNERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity Recognition Tools in the Web of DataGiuseppe Rizzo
 

More from Giuseppe Rizzo (20)

Artificial intelligence for social good
Artificial intelligence for social goodArtificial intelligence for social good
Artificial intelligence for social good
 
AI in 60 minutes
AI in 60 minutesAI in 60 minutes
AI in 60 minutes
 
COMPRENDE, PERSONALIZZA, INTERAGISCE E IMPARA: L’AI COGNITIVA PER L’HR
COMPRENDE, PERSONALIZZA, INTERAGISCE E  IMPARA: L’AI COGNITIVA PER L’HRCOMPRENDE, PERSONALIZZA, INTERAGISCE E  IMPARA: L’AI COGNITIVA PER L’HR
COMPRENDE, PERSONALIZZA, INTERAGISCE E IMPARA: L’AI COGNITIVA PER L’HR
 
Understand, Answer and Argument: Conversational Agents
Understand, Answer and Argument: Conversational AgentsUnderstand, Answer and Argument: Conversational Agents
Understand, Answer and Argument: Conversational Agents
 
AI For Profiling Your Customers
AI For Profiling Your CustomersAI For Profiling Your Customers
AI For Profiling Your Customers
 
AI for Personalized Chatbot
AI for Personalized ChatbotAI for Personalized Chatbot
AI for Personalized Chatbot
 
Tourist Knowledge Graph Creation to Automating Travel Bookings
Tourist Knowledge Graph Creation to Automating Travel BookingsTourist Knowledge Graph Creation to Automating Travel Bookings
Tourist Knowledge Graph Creation to Automating Travel Bookings
 
Context-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity LinkingContext-Enhanced Adaptive Entity Linking
Context-Enhanced Adaptive Entity Linking
 
From Data to Knowledge for Tourists
From Data to Knowledge for TouristsFrom Data to Knowledge for Tourists
From Data to Knowledge for Tourists
 
Enabling Visitors to Explore a Smart City
Enabling Visitors to Explore a Smart CityEnabling Visitors to Explore a Smart City
Enabling Visitors to Explore a Smart City
 
NEEL2015 challenge summary
NEEL2015 challenge summaryNEEL2015 challenge summary
NEEL2015 challenge summary
 
Inductive Entity Typing Alignment
Inductive Entity Typing AlignmentInductive Entity Typing Alignment
Inductive Entity Typing Alignment
 
Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...
Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...
Benchmarking the Extraction and Disambiguation of Named Entities on the Seman...
 
CrossLanguageSpotter: A Library for Detecting Relations in Polyglot Frameworks
CrossLanguageSpotter: A Library for Detecting Relations in Polyglot FrameworksCrossLanguageSpotter: A Library for Detecting Relations in Polyglot Frameworks
CrossLanguageSpotter: A Library for Detecting Relations in Polyglot Frameworks
 
Learning with the Web. Structuring data to ease machine understanding
Learning with the Web. Structuring data to ease  machine understandingLearning with the Web. Structuring data to ease  machine understanding
Learning with the Web. Structuring data to ease machine understanding
 
Learning with the Web: Spotting Named Entities on the intersection of NERD an...
Learning with the Web: Spotting Named Entities on the intersection of NERD an...Learning with the Web: Spotting Named Entities on the intersection of NERD an...
Learning with the Web: Spotting Named Entities on the intersection of NERD an...
 
NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud
NERD meets NIF:  Lifting NLP Extraction Results to the Linked Data CloudNERD meets NIF:  Lifting NLP Extraction Results to the Linked Data Cloud
NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud
 
The NERD project
The NERD projectThe NERD project
The NERD project
 
L'enorme archivio di dati: il Web
L'enorme archivio di dati: il WebL'enorme archivio di dati: il Web
L'enorme archivio di dati: il Web
 
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity Recognition Tools in the Web of DataNERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
 

Recently uploaded

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersSabitha Banu
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 

Recently uploaded (20)

What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
DATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginnersDATA STRUCTURE AND ALGORITHM for beginners
DATA STRUCTURE AND ALGORITHM for beginners
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 

Annotating streams of heterogeneous data for topic generation

  • 1. Annotating streams of heterogeneous data for topic generation Giuseppe Rizzo giuseppe.rizzo@eurecom.fr @giusepperizzo
  • 2. Spotting entities while reading a document ➢ Name of People, Locations, Organizations, etc.. ➢ Named entities are fundamental keys for topic understanding ➢ But, the same location can refer source: http://goo.gl/kVzlK to different places Ferbruary 6, 2013 VU University Amsterdam, NL 2/22
  • 3. A Web of Linked Entities ➢ GGG (global giant graph) http://goo.gl/fH3h ➢ Nodes are Web entities source: http://wole2013.eurecom.fr ➢ Entities provide disambiguation pointers ➢ Entities can be univocally referred (disambiguated) ➢ Entities as centroids for topic generation and undestanding source: http://wole2012.eurecom.fr Ferbruary 6, 2013 VU University Amsterdam, NL 3/22
  • 4. on Entity extractors I ati UR gu bi I AP am eb is W D Ferbruary 6, 2013 VU University Amsterdam, NL 4/22
  • 5. Diversity Alchemy DBpedia Extractiv Lupedia Open Saplo Semi Wikimeta Yahoo! Zemanta API Spotlight Calais Tags Language EN,FR, EN EN EN,FR, EN,FR EN, DE, EN,FR EN EN DE,IT, IT SP SW NL SP PT,RU, SP,SW Granularity OEN OEN OEN OEN OEN OED OED OEN OEN OED Entity N/A char word range of char N/A char POS range N/A position offset offset chars offset offset offset of chars Classification Alchemy DBpedia Extractiv DBpedia Open Saplo ConLL ESTER Yahoo FreeBase schema FreeBase LinkedM Calais -3 Scema.or DB g Number of 324 320 34 319 95 5 4 7 13 81 classes Response JSON HTML HTML HTML JSON JSON XML JSON JSON XML Format MicroF JSON JSON JSON MicroF XML XML JSON XML RDF RDF RDFa ormat RDF RDF XML XML XML Quota 30000 unl 3000 unl 50000 1333 unl unl 5000 10000 (calls/day) Ferbruary 6, 2013 VU University Amsterdam, NL 5/22
  • 6. Harmonizing annotations http://nerd.eurecom.fr ontology1 REST API2 UI3 1 http://nerd.eurecom.fr/ontology 2 http://nerd.eurecom.fr/api/application.wadl 3 http://nerd.eurecom.fr Ferbruary 6, 2013 VU University Amsterdam, NL 6/22
  • 7. NERD Ontology NERD type Occurrence Person 10 Organization 10 Country 6 Company 6 Location 6 Continent 5 City 5 RadioStation 5 Album 5 Product 5 ... ... The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data Ferbruary 6, 2013 VU University Amsterdam, NL 7/22
  • 8. ETAPE2012 ➢ DGA (French radio transcripts) – Train: 7h 50m – Dev: 3h – Eval: 3h ➢ ELDA (French TV transcripts) – Train: 18h 10m – Dev: 7h 55m – Eval: 7h 55m ➢ Annotation schema Quaero: 32 classes Ferbruary 6, 2013 VU University Amsterdam, NL 8/22
  • 9. We can do better: combined 2 201 A PE ET extraction (eA1,tA1,URIA1,siA1,eiA1) ... ... ... cleaning (eA2,tA2,URIA2,siA2,eiA2) (eA3,tA3,URIA3,siA3,eiA3) fusion When at least 2 extractors (eN1,tN1,URIN1,siN1,eiN1) classify the same entity with a (eN2,tN2,URIN2,siN2,eiN2) different type then we apply a preferred selection order (learning rules): Wikimeta, AlchemyAPI, OpenCalais, Lupedia Ferbruary 6, 2013 VU University Amsterdam, NL 9/22
  • 10. … but it introduced systematic errors 201 2 A PE ET SLR (Slot prec recall F1 %correct Error Rate) alchemyapi 37.71% 47.95% 5.45% 9.68% 5.45% lupedia 39.49% 22.87% 1.56% 2.91% 1.56% opencalais 37.47% 41.69% 3.53% 6.49% 3.53% wikimeta 36.67% 19.40% 4.25% 6.95% 4.25% combined 86.85% 35.31% 17.69% 23.44% 17.69% (nerd) Ferbruary 6, 2013 VU University Amsterdam, NL 10/22
  • 11. Gazetteers: combined+ 2 201 A PE ET ... Learned model POS tagger Created (eA1,tA1,URIA1,siA1,eA1) Apply rules static rules (eA2,tA2,URIA2,siA2,eiA2) fusion (e1,t1,URI1,si1,ei1) Conflicts handled by priority selection:own, Wikimeta,AlchemyAPI, OpenCalais,Lupedia (eN1,tN1,URIN1,sN1,eN1) ` Ferbruary 6, 2013 VU University Amsterdam, NL 11/22
  • 12. Over-estimated training model 2 201 A PE ET SLR (Slot prec recall F1 %correct Error Rate) combined 86.85% 35.31% 17.69% 23.44% 17.69% combined+ 188.81% 15.13% 28.40% 19.45% 28.40% Ferbruary 6, 2013 VU University Amsterdam, NL 12/22
  • 13. General NER limitations ➢ Perfomances drop – with common settings using off-the-shelf models, while annotating corpora which differs from the training model (empirically recall drops of ~20%) – with noisy texts such as transcripts, microposts ➢ Lack of knowledge for particular categories, in particular Event Ferbruary 6, 2013 VU University Amsterdam, NL 13/22
  • 14. Participation at the #MSM2013 challenge in g ➢ English Twitter posts go – Train: 2815 posts on – Eval: 1526 posts ➢ Annotation schema: 4 classes ➢ Objective: perform better than the Stanford CFR, properly trained with the challenge settings prec recall F1 LOC 80.12% 57.76% 67.63% MISC 68.18% 31.51% 43.10% ORG 83.28% 50.71% 63.04% PER 79.93% 70.72% 75.04% 4-fold cross validation over training - provisional results of the Stanford CFR Ferbruary 6, 2013 VU University Amsterdam, NL 14/22
  • 15. Poor performances of spotting events ➢ Exploit large domain knowledge represented by the Eventmedia dataset1 ➢ EventSpotter – Entities classified according to the LODE ontology – Spotting according to the event name, agents, temporal and geo spatial information – Confidence computed according to the similarity of the surrounding text where the entity has been spotted and the event description – Disambiguation provided by the event URIs (nodes of the Eventmedia graph) 1 http://eventmedia.eurecom.fr/sparql Ferbruary 6, 2013 VU University Amsterdam, NL 15/22
  • 16. Entities for concept mining ➢ Used to annotate textual data – news articles, and ... ➢ Video transcripts: – video segmentation (MediaFragment) – MediaFragment annotation – indexing – topic model generation ➢ Microposts: – text understanding – topic model generation Ferbruary 6, 2013 VU University Amsterdam, NL 16/22
  • 17. Media Fragment Enricher joint work between University of source: http://goo.gl/BMZK3 Southampton and EURECOM Ferbruary 6, 2013 VU University Amsterdam, NL 17/22
  • 18. Annotating social streams ➢ Live and fresh breaking news: microposts ➢ Media items, such as pictures and videos, usually are attached to microposts ➢ Grouping microposts: – Entity labels – Entity classes – Latent Dirichlet allocation (LDA) – Density based micropost proximity (text similarity, entity similarity, temporal distance) ➢ Create textual storyboards from vox populi ➢ Describe visually the created storyboards Ferbruary 6, 2013 VU University Amsterdam, NL 18/22
  • 19. Centroids for topic generation ➢ Each cloud represents a topic ➢ A topic is depicted by an entity ➢ Leaf are media items, which visually represent the microposts ➢ Each leaf can belong to many topics Ferbruary 6, 2013 VU University Amsterdam, NL 19/22
  • 20. Topic storyboard ➢ Visual summary of the topic ➢ Topic is labelled with an entity ➢ A poster picture is displayed according to the relevance of the micropost in the generated topic ➢ If the entity points to a LOD resource, a textual description is displayed Ferbruary 6, 2013 VU University Amsterdam, NL 20/22
  • 21. Outlook ➢ Modelling heterogeneous data with entities ➢ Linking entities according to the topics extracted from the text ➢ Enhancing topic modelling with the GGG ➢ Providing visual storyboards tailored with the extracted topics Ferbruary 6, 2013 VU University Amsterdam, NL 21/22
  • 22. Thanks for your time and attention Agenda: – Web of Linked Entities (sl. 3) – Aligning annotations (sl. 6) – Combining performances of 3rd- party entity extractors (sl. 9) – Spotting events (sl. 15) – Annotating MFs and microposts for topic generation (sl. 16) – Topic storyboard generation (sl. 19) http://www.slideshare.net/giusepperizzo Ferbruary 6, 2013 VU University Amsterdam, NL 22/22