Sächsische AufbauBankForschung und Entwicklung - ProjektförderungProjektnummer - 99457/2677Martin Voigt, Michael Aleythe, ...
StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S ...
StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S ...
Motivationfink & PARTNER Media Services GmbHMedia management for publishing housesSome customersChair of Multimedia Techno...
MotivationNewsroomFriday, 14.06.2013 Topic/S Slide 4Quelle: ringier.com
ProblemOverwhelming amount of datae.g., Mainpost 2000 articles/day from agenciesand in-house productionFriday, 14.06.2013 ...
ProblemFriday, 14.06.2013 Topic/S Slide 6
ProblemHard to identify topicsBrowsingKeyword-IdentificationAnd theirRelations, Media, and TrendFriday, 14.06.2013 Topic/S...
VisionAutomatic topic discovery using Named Entities andother keywords (Semantic Items, SemItem)Investigation of trending ...
RequirementsExtraction and disambiguationof (German) SemItemsModel and storage of semanticinformationTopic and trend disco...
StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Po...
WorkflowFriday, 14.06.2013 Topic/S Slide 11
StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Po...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SLanguage RecognitionBased on article contentSupport German/EnglishRule-bas...
Workflow: PräprozessorFriday, 14.06.2013 Topic/SKeywordsLemmatizationDeveloping a word listExtraction using the word listB...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisationClassification of textOne categorizer per news-agencyIPTC ca...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisation - TrainingPoliticsArticle IPTC Media Topic CategoriserSlide...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisation - TrainingPoliticsArticle IPTC Media Topic Categoriser OTSP...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisationPoliticsArticle DPA IPTC Media TopicCategoriser OTSCategoris...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisation - QualityNews-Agency accuracyKNA 80,3 %DPA 94,4 %EPD 80,3 %...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SNamed Entity RecognitionRecognition of persons,organizations, placestwo me...
Workflow: PreprocessorFriday, 14.06.2013 Topic/SNamed Entity Recognition – Approachesword listTool: LingPipe + ExtensionSo...
StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Po...
Semantic ModelRequirementsinformation life cycle | simple | fast querying |schema reuse | inference | ...FoundationsSNaP O...
Semantic ModelFriday, 14.06.2013 Topic/S Slide 24
Storage of Semantic DataBenchmark of triple stores [Voigt2012]No benchmark found with real-world data, inference,SPARQL 1....
Storage of Semantic DataUsing Oracle 11gR2ProsAlready available, existing knowledgeNearly as fast as Virtuoso etc.Integrat...
Semantic FactsNamed Entities required but no lists availableManual search, extraction, andcleaning for named entities from...
Semantic FactsBUT only named entities cause bad topics keywordsrequired, e.g.,Waffenstillstand (cease-fire), Meister(champ...
StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Po...
Workflow: PostprocessorFriday, 14.06.2013 Topic/SClusteringSlide 30
Workflow: PostprocessorFriday, 14.06.2013 Topic/SClusteringSlide 31
Workflow: PostprocessorFriday, 14.06.2013 Topic/SClusteringMerkelPoliticsHighwayTrafficAudiObamaSlide 32
Workflow: PostprocessorFriday, 14.06.2013 Topic/SClustering (Top Cluster 06.06.2013)Article FirstDateName HotTopic7 6.6. "...
Workflow: PostprocessorFriday, 14.06.2013 Topic/STopic trendDate Article SemItems4.6. 6 "Demonstrant","Ministerpräsident",...
StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Po...
Workflow: Related ArticleFriday, 14.06.2013 Topic/SRelated Article• Person• Location• Organisation• KeywordsSlide 36
Workflow: Related ArticleFriday, 14.06.2013 Topic/SRelated Article - relatedness• computes topic-based difference betweena...
Workflow: Related ArticleFriday, 14.06.2013 Topic/SRelated Article - relatednessBernd LuckeBerlinOccurrence: 1 + 4Occurren...
StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S ...
Live DemoFriday, 14.06.2013 Topic/S Slide 40
StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskUser InterfacesDisambiguationConclusi...
Static User InterfaceFriday, 14.06.2013 Topic/S Slide 42
Dynamic User InterfaceFriday, 14.06.2013 Topic/S Slide 43
StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskUser InterfacesConclusDisambiguationi...
DisambiguationFriday, 14.06.2013 Topic/S Slide 45Quelle: fansshare.comQuelle: lounge.espdisk.comQuelle: de.wikipedia.org
DisambiguationProblem: not all SemItems available in the LODFriday, 14.06.2013 Topic/SMichael JacksonBeerMichael JacksonBe...
StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S ...
Sum it up!ResultIdentifying topics and pushing themto the editorLessons learnedNER: bad for non-English,combination requir...
Sächsische AufbauBankForschung und Entwicklung - ProjektförderungProjektnummer - 99457/2677Thanks! Questions?
Nächste SlideShare
Wird geladen in …5
×

Towards Topics-based, Semantics-assisted News Search | WIMS13

1.181 Aufrufe

Veröffentlicht am

information extraction, modelling and storage of semantic data to recognize trending topics for journalism and newspaper offices

Veröffentlicht in: Technologie, Bildung
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Towards Topics-based, Semantics-assisted News Search | WIMS13

  1. 1. Sächsische AufbauBankForschung und Entwicklung - ProjektförderungProjektnummer - 99457/2677Martin Voigt, Michael Aleythe, Peter Wehner
  2. 2. StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 1
  3. 3. StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 2
  4. 4. Motivationfink & PARTNER Media Services GmbHMedia management for publishing housesSome customersChair of Multimedia Technology, TU DresdenResearch fieldsAdaptive, composite Rich Internet ApplicationsSemantic document life cycle managementFriday, 14.06.2013 Topic/S Slide 3
  5. 5. MotivationNewsroomFriday, 14.06.2013 Topic/S Slide 4Quelle: ringier.com
  6. 6. ProblemOverwhelming amount of datae.g., Mainpost 2000 articles/day from agenciesand in-house productionFriday, 14.06.2013 Topic/SDPAReutersKNATwitterFacebookBlogs…News agenciesWeb, social media…In-house productionArchiveOnlineSlide 5
  7. 7. ProblemFriday, 14.06.2013 Topic/S Slide 6
  8. 8. ProblemHard to identify topicsBrowsingKeyword-IdentificationAnd theirRelations, Media, and TrendFriday, 14.06.2013 Topic/S Slide 7Quelle: Zeit.de
  9. 9. VisionAutomatic topic discovery using Named Entities andother keywords (Semantic Items, SemItem)Investigation of trending topicsPush them to the editorFriday, 14.06.2013 Topic/SMA1E1E2E4E3E7E6E5MA2MediaAssetsNamedEntitiesPre-ProcessingMA1E1T1E2E4E3E7E6T2T3E5MA2MediaAssetsNamedEntitiesTopicsPre-Processing Post-ProcessingSlide 8
  10. 10. RequirementsExtraction and disambiguationof (German) SemItemsModel and storage of semanticinformationTopic and trend discoveryScalable architecture forbusiness use caseFriday, 14.06.2013 Topic/S Slide 9
  11. 11. StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Post-Processing– Search and User InterfaceDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 10
  12. 12. WorkflowFriday, 14.06.2013 Topic/S Slide 11
  13. 13. StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Post-Processing– Search and User InterfaceDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 12
  14. 14. Workflow: PreprocessorFriday, 14.06.2013 Topic/SLanguage RecognitionBased on article contentSupport German/EnglishRule-based solution:– Words with capital letter (en 18% vs. de 43%)– Occurrence of umlauts (ä,ö,ü)– Existence of language specific words• en: of, to, and, a, for, the, that• de: der, das, und, sich, aufPrecision: 99%Slide 13Quelle: onelanguageoneposter.com
  15. 15. Workflow: PräprozessorFriday, 14.06.2013 Topic/SKeywordsLemmatizationDeveloping a word listExtraction using the word listBonus: frequent terms of an articleSlide 14Quelle: hugdaily.org
  16. 16. Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisationClassification of textOne categorizer per news-agencyIPTC categoriesCategories useful for identifying topicsSlide 15
  17. 17. Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisation - TrainingPoliticsArticle IPTC Media Topic CategoriserSlide 16
  18. 18. Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisation - TrainingPoliticsArticle IPTC Media Topic Categoriser OTSPoliticsArticle IPTC Media Topic Categoriser ReutersPoliticsArticle IPTC Media Topic Categoriser DPADPAReutersOTSSlide 17
  19. 19. Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisationPoliticsArticle DPA IPTC Media TopicCategoriser OTSCategoriser DPACategoriser ReutersSlide 18
  20. 20. Workflow: PreprocessorFriday, 14.06.2013 Topic/SCategorisation - QualityNews-Agency accuracyKNA 80,3 %DPA 94,4 %EPD 80,3 %Reuters 90,8 %OTS 93,5 %AFP 86 %Method accuracyOne cat. for all agencies 85 %One cat. per agency 87,5 %Slide 19
  21. 21. Workflow: PreprocessorFriday, 14.06.2013 Topic/SNamed Entity RecognitionRecognition of persons,organizations, placestwo methods: word list, statisticsadditional information:– occurrence count– text part NE appeared inSlide 20Quelle: churchthought.com
  22. 22. Workflow: PreprocessorFriday, 14.06.2013 Topic/SNamed Entity Recognition – Approachesword listTool: LingPipe + ExtensionSources: LOD (DBPedia, Geonames, YAGO2)Advantages: controlled vocabulary,guarantied recognition of entitiesstatisticsTool: Stanford NLPSource: pre-trained modelAdvantage: Recognition of unknown entitiesSlide 21
  23. 23. StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Post-Processing– Search and User InterfaceDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 22
  24. 24. Semantic ModelRequirementsinformation life cycle | simple | fast querying |schema reuse | inference | ...FoundationsSNaP Ontologies, IPTC NewsCodes, W3C Ontologyfor Media Resources, schema.orgRDFS, less OWLConventions, versioning, and documentationFriday, 14.06.2013 Topic/S Slide 23
  25. 25. Semantic ModelFriday, 14.06.2013 Topic/S Slide 24
  26. 26. Storage of Semantic DataBenchmark of triple stores [Voigt2012]No benchmark found with real-world data, inference,SPARQL 1.1, and multi-clientWhat have we done?4 datasets, 5 stores, 15 queries per datasetLoading time, memory requirement, per-querytype & multi-client performanceResultNo clear recommendation, strongly depends onproject requirementsFriday, 14.06.2013 Topic/S Slide 25
  27. 27. Storage of Semantic DataUsing Oracle 11gR2ProsAlready available, existing knowledgeNearly as fast as Virtuoso etc.Integrated querying of relational andsemantic dataSpatial data mining featuresConsInferenceIncomplete SPARQL 1.1 supportLimited custom rule supportFriday, 14.06.2013 Topic/S Slide 26
  28. 28. Semantic FactsNamed Entities required but no lists availableManual search, extraction, andcleaning for named entities fromYAGO2 , Freebase, JRC_Names,Tagesspiegel, DBpediaStored preferred and alternative namesID: http://www.topic-s.de/topics-facts/id/person/Rene_MullerNames: Rene Muller, Rene Müller, René Muller, René MüllerFriday, 14.06.2013 Topic/S Slide 27
  29. 29. Semantic FactsBUT only named entities cause bad topics keywordsrequired, e.g.,Waffenstillstand (cease-fire), Meister(champion), Klimaschutz (climate protection), …Some numbersTriples without SemItems: 10,3 Mio.Friday, 14.06.2013 Topic/SSemItem Number (with alt. names)Person 590.828 (860.594)Organization 63.262 (98.052)Place 89.672 (95.146)Keyword 1329Slide 28
  30. 30. StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Post-Processing– Search and User InterfaceDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 29
  31. 31. Workflow: PostprocessorFriday, 14.06.2013 Topic/SClusteringSlide 30
  32. 32. Workflow: PostprocessorFriday, 14.06.2013 Topic/SClusteringSlide 31
  33. 33. Workflow: PostprocessorFriday, 14.06.2013 Topic/SClusteringMerkelPoliticsHighwayTrafficAudiObamaSlide 32
  34. 34. Workflow: PostprocessorFriday, 14.06.2013 Topic/SClustering (Top Cluster 06.06.2013)Article FirstDateName HotTopic7 6.6. "Bürgermeister","Gemeinde","Gemeinderat", "Kosten"No4 6.6. "Abzug", "Bürgerkrieg", "Grenze","Soldat", "Österreich", "Syrien", "Tel Aviv","Vereinten Nationen"Yes3 6.6. "Vertrag", "Vorstandschef","München","FCBayern München","FC Bayern MünchenAG","Olympique Marseille","Daniel VanBuyten","Franck Ribery","Karl-HeinzRummenigge"Yes2 4.6. "Ministerpräsident","Protest","Istanbul","Tunis","Recep Tayyip Erdogan"YesSlide 33
  35. 35. Workflow: PostprocessorFriday, 14.06.2013 Topic/STopic trendDate Article SemItems4.6. 6 "Demonstrant","Ministerpräsident","Protest","Regierung","Stadtteil","Istanbul","Recep TayyipErdogan5.6. 14 "Demonstrant","Protest","Istanbul","RecepTayyip Erdogan"6.6. 2 "Ministerpräsident","Protest","Istanbul","Tunis","Recep Tayyip Erdogan"7.6. 9 "Demonstrant","Protest","Recep Tayyip Erdogan"Slide 34
  36. 36. StructureMotivation, Problems, and GoalsTopic/S Workflow– Overview– Pre-Processing– Semantic Model, Facts, and Storage– Post-Processing– Search and User InterfaceDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 35
  37. 37. Workflow: Related ArticleFriday, 14.06.2013 Topic/SRelated Article• Person• Location• Organisation• KeywordsSlide 36
  38. 38. Workflow: Related ArticleFriday, 14.06.2013 Topic/SRelated Article - relatedness• computes topic-based difference betweenarticles• Detecting main entities in articles• navigation recommendation for userSlide 37
  39. 39. Workflow: Related ArticleFriday, 14.06.2013 Topic/SRelated Article - relatednessBernd LuckeBerlinOccurrence: 1 + 4Occurrence : 0 + 1Bernd LuckeOccurrence : 0 + 4AfDOccurrence : 0 + 5BerlinKlaus WowereitOccurrence : 1 + 3Occurrence : 1 + 4Slide 38
  40. 40. StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 39
  41. 41. Live DemoFriday, 14.06.2013 Topic/S Slide 40
  42. 42. StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskUser InterfacesDisambiguationConclusionFriday, 14.06.2013 Topic/S Slide 41
  43. 43. Static User InterfaceFriday, 14.06.2013 Topic/S Slide 42
  44. 44. Dynamic User InterfaceFriday, 14.06.2013 Topic/S Slide 43
  45. 45. StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskUser InterfacesConclusDisambiguationionFriday, 14.06.2013 Topic/S Slide 44
  46. 46. DisambiguationFriday, 14.06.2013 Topic/S Slide 45Quelle: fansshare.comQuelle: lounge.espdisk.comQuelle: de.wikipedia.org
  47. 47. DisambiguationProblem: not all SemItems available in the LODFriday, 14.06.2013 Topic/SMichael JacksonBeerMichael JacksonBeerWhiskeyMichael JacksonMusicKing of PopInternal FactsExternal Facts(DBpedia, etc.)Identification ofEntity ClusterSlide 46
  48. 48. StructureMotivation, Problems, and GoalsTopic/S WorkflowDemoCurrent and Upcoming TaskConclusionFriday, 14.06.2013 Topic/S Slide 47
  49. 49. Sum it up!ResultIdentifying topics and pushing themto the editorLessons learnedNER: bad for non-English,combination requiredmodel needs to be optimized forqueriesdedicated user interface requiredOutlookprediction of topics withcausal/temporal relationsFriday, 14.06.2013 Topic/S Slide 48Quelle: ooltapulta.comQuelle: business-strategy-innovation.com
  50. 50. Sächsische AufbauBankForschung und Entwicklung - ProjektförderungProjektnummer - 99457/2677Thanks! Questions?

×