Michael Aleythe, Martin Voigt, Peter Wehner

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnum...
Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 1
Motivation
Newsroom

Quelle: ringier.com

Friday, 06.09.2013

Topic/S

Slide 2
Problem
In-house production
Archive

News agencies

Web, social media

Online

DPA

Twitter

Reuters

Facebook

KNA

Blogs...
Vision
Automatic topic discovery using Named Entities and
other keywords (Semantic Items, SemItem)
Investigation of trendi...
Structure
Motivation, Problems, and Goals

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detecti...
Workflow

Friday, 06.09.2013

Topic/S

Slide 6
Workflow: Preprocessor
Language Recognition (Ger/Eng)
Rule based
Named Entity Extraction
word list + statistics

Source: o...
Structure
Motivation, Problems, and Goals

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detecti...
Semantic Model

Friday, 06.09.2013

Topic/S

Slide 9
Semantic Facts
Named Entities required but no lists available
SemItem

Number (with alt. names)

Person

1.504.341 (2.499....
Storage of Semantic Data
Using Oracle 11gR2
Pros
Already available, existing knowledge
Integrated querying of relational a...
Structure
Motivation, Problems, and Goals

Topic/S Workflow
–
–
–
–

Overview
Information Extraction
Storage
Topic Detecti...
Workflow: Topic Detection
Clustering

Friday, 06.09.2013

Topic/S

Slide 13
Workflow: Topic Detection
Clustering

Friday, 06.09.2013

Topic/S

Slide 14
Workflow: Topic Detection
Clustering
Obama
Merkel

Politics

Audi
Highway

Traffic

Friday, 06.09.2013

Topic/S

Slide 15
Workflow: Topic Detection
Clustering (Top Cluster 25.08.2013)
Article Name
43

Bundesliga, Fußball, Spieltag , 1. FC Union...
Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 17
Live Demo

Friday, 06.09.2013

Topic/S

Slide 18
Structure
Motivation, Problems, and Goals

Topic/S Workflow
Demo
Conclusion

Friday, 06.09.2013

Topic/S

Slide 19
Sum it up!
Result
Identifying topics and pushing them
to the editor
Lessons learned
NER: bad for non-English,
combination ...
Thanks! Questions?

Sächsische AufbauBank
Forschung und Entwicklung - Projektförderung
Projektnummer - 99457/2677
Workflow: Preprocessor
Named Entity Recognition
word list
Tool: LingPipe + Extension
Quelle: churchthought.com
Sources: LO...
Workflow: Preprocessor
Categorization

Categoriser Reuters

Politics
Article DPA

Categoriser DPA

IPTC Media Topic

Categ...
Workflow: Preprocessor
Categorization - Quality
News-Agency
KNA

80,3 %

DPA

94,4 %

EPD

80,3 %

Reuters

90,8 %

OTS

9...
Workflow: Preprocessor
Keywords
Lemmatization

Quelle: hugdaily.org

Developing a word list

Extraction using the word lis...
Disambiguation

Quelle: de.wikipedia.org

Quelle: fansshare.com

Quelle: lounge.espdisk.com

Friday, 06.09.2013

Topic/S

...
Disambiguation
Identification of
Entity Cluster

Michael Jackson
Internal Facts

Beer
Michael Jackson
Beer

Whiskey
Michae...
Nächste SlideShare
Wird geladen in …5
×

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

657 Aufrufe

Veröffentlicht am

information extraction, modelling and storage of semantic data to recognize trending topics for journalism and newspaper offices

Veröffentlicht in: Technologie, Bildung
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Topic/S – A Topic and Trend Recognition Approach in News-Media, I-Semantics13

  1. 1. Michael Aleythe, Martin Voigt, Peter Wehner Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677
  2. 2. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Conclusion Friday, 06.09.2013 Topic/S Slide 1
  3. 3. Motivation Newsroom Quelle: ringier.com Friday, 06.09.2013 Topic/S Slide 2
  4. 4. Problem In-house production Archive News agencies Web, social media Online DPA Twitter Reuters Facebook KNA Blogs … … Overwhelming amount of data e.g., WAZ  5000 articles/day from agencies and inhouse production Friday, 06.09.2013 Topic/S Slide 3
  5. 5. Vision Automatic topic discovery using Named Entities and other keywords (Semantic Items, SemItem) Investigation of trending topics Media Assets Named Entities Topics E1 E2 E3 MA1 E4 Push them to the editor T1 T2 E5 MA2 E6 T3 E7 Pre-Processing Friday, 06.09.2013 Topic/S Slide 4 Post-Processing
  6. 6. Structure Motivation, Problems, and Goals Topic/S Workflow – – – – Overview Information Extraction Storage Topic Detection Demo Conclusion Friday, 06.09.2013 Topic/S Slide 5
  7. 7. Workflow Friday, 06.09.2013 Topic/S Slide 6
  8. 8. Workflow: Preprocessor Language Recognition (Ger/Eng) Rule based Named Entity Extraction word list + statistics Source: onelanguageoneposter.com Keyword Extraction Lemmatization, word list Categorisation Source based Friday, 06.09.2013 Topic/S Slide 7
  9. 9. Structure Motivation, Problems, and Goals Topic/S Workflow – – – – Overview Information Extraction Storage Topic Detection Demo Conclusion Friday, 06.09.2013 Topic/S Slide 8
  10. 10. Semantic Model Friday, 06.09.2013 Topic/S Slide 9
  11. 11. Semantic Facts Named Entities required but no lists available SemItem Number (with alt. names) Person 1.504.341 (2.499.962) Organization 63.332 (98.127) Place 89.702 (95.178) Keyword 1351 Stored preferred and alternative names ID: http://www.topic-s.de/topics-facts/id/person/Rene_Muller Names: Rene Muller, Rene Müller, René Muller, René Müller Triples without SemItems: 27,6 Mio. Friday, 06.09.2013 Topic/S Slide 10
  12. 12. Storage of Semantic Data Using Oracle 11gR2 Pros Already available, existing knowledge Integrated querying of relational and semantic data Cons Inference Incomplete SPARQL 1.1 support Limited custom rule support Benchmark of triple stores [Voigt2012] Friday, 06.09.2013 Topic/S Slide 11
  13. 13. Structure Motivation, Problems, and Goals Topic/S Workflow – – – – Overview Information Extraction Storage Topic Detection Demo Conclusion Friday, 06.09.2013 Topic/S Slide 12
  14. 14. Workflow: Topic Detection Clustering Friday, 06.09.2013 Topic/S Slide 13
  15. 15. Workflow: Topic Detection Clustering Friday, 06.09.2013 Topic/S Slide 14
  16. 16. Workflow: Topic Detection Clustering Obama Merkel Politics Audi Highway Traffic Friday, 06.09.2013 Topic/S Slide 15
  17. 17. Workflow: Topic Detection Clustering (Top Cluster 25.08.2013) Article Name 43 Bundesliga, Fußball, Spieltag , 1. FC Union Berlin, SC Paderborn 07 eV, FC Augsburg, FSV Frankfurt Yes 25 Euro, SPD, Berlin, Griechenland, FDP, CDU, Deutschland Yes 19 Bericht, Diplomat, Google Inc , Anbieter, Berlin, Deutschland, Auto Yes 18 Veranstaltung, Bernd Lucke, Angreifer, Berlin, Polizei, Angriff, Deutschland Yes 15 Friday, 06.09.2013 HotTopic Gericht, Prozess, Bo Xilai, Christian Wulff, Anklage, Verfahren, Mord Yes Topic/S Slide 16
  18. 18. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Conclusion Friday, 06.09.2013 Topic/S Slide 17
  19. 19. Live Demo Friday, 06.09.2013 Topic/S Slide 18
  20. 20. Structure Motivation, Problems, and Goals Topic/S Workflow Demo Conclusion Friday, 06.09.2013 Topic/S Slide 19
  21. 21. Sum it up! Result Identifying topics and pushing them to the editor Lessons learned NER: bad for non-English, combination required model needs to be optimized for queries dedicated user interface required Outlook prediction of topics with causal/temporal relations Friday, 06.09.2013 Topic/S Quelle: ooltapulta.com Quelle: business-strategy-innovation.com Slide 20
  22. 22. Thanks! Questions? Sächsische AufbauBank Forschung und Entwicklung - Projektförderung Projektnummer - 99457/2677
  23. 23. Workflow: Preprocessor Named Entity Recognition word list Tool: LingPipe + Extension Quelle: churchthought.com Sources: LOD (DBPedia, Geonames, YAGO2, GND) Advantages: controlled vocabulary, guarantied recognition of entities statistics Tool: Stanford NLP Source: pre-trained model Advantage: Recognition of unknown entities Friday, 06.09.2013 Topic/S Slide 22
  24. 24. Workflow: Preprocessor Categorization Categoriser Reuters Politics Article DPA Categoriser DPA IPTC Media Topic Categoriser OTS Friday, 06.09.2013 Topic/S Slide 23
  25. 25. Workflow: Preprocessor Categorization - Quality News-Agency KNA 80,3 % DPA 94,4 % EPD 80,3 % Reuters 90,8 % OTS 93,5 % AFP 86 % Method accuracy One cat. for all agencies 85 % One cat. per agency Friday, 06.09.2013 accuracy 87,5 % Topic/S Slide 24
  26. 26. Workflow: Preprocessor Keywords Lemmatization Quelle: hugdaily.org Developing a word list Extraction using the word list Bonus: frequent terms of an article Friday, 06.09.2013 Topic/S Slide 25
  27. 27. Disambiguation Quelle: de.wikipedia.org Quelle: fansshare.com Quelle: lounge.espdisk.com Friday, 06.09.2013 Topic/S Slide 26
  28. 28. Disambiguation Identification of Entity Cluster Michael Jackson Internal Facts Beer Michael Jackson Beer Whiskey Michael Jackson External Facts (DBpedia, etc.) Music King of Pop Problem: not all SemItems available in the LOD Friday, 06.09.2013 Topic/S Slide 27

×