Anzeige

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

Professor for Data Science um University of Mannheim
10. Dec 2019
Anzeige

Más contenido relacionado

Presentaciones para ti(20)

Similar a Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block(20)

Anzeige
Anzeige

Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block

  1. 12/10/2019 Heiko Paulheim 1 Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block Heiko Paulheim
  2. 12/10/2019 Heiko Paulheim 2 The New Kids on the (Knowledge Graph) Block Subjective age: Measured by the fraction of the audience that understands a reference to your young days’ pop culture...
  3. 12/10/2019 Heiko Paulheim 3 Knowledge Graphs: Out of the Dark Google’s Announcement DBpedia YAGO ResearchCyc WikidataFreebase NELL
  4. 12/10/2019 Heiko Paulheim 4 A Brief History of Knowledge Graphs • The 1980s: Cyc – Cyc: Encyclopedic collection of knowledge – Started by Douglas Lenat in 1984 – Estimation: 350 person years and 250,000 rules should do the job of collecting the essence of the world’s knowledge • The present (as of June 2017) – ~1,000 person years, $120M total development cost – 21M axioms and rules
  5. 12/10/2019 Heiko Paulheim 5 A Brief History of Knowledge Graphs • The 2010s – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up
  6. 12/10/2019 Heiko Paulheim 6 Getting the Most out of Wikipedia • Study for KG-based Recommender Systems* – DBpedia has a coverage of • 85% for movies • 63% for music artists • 31% for books *) Di Noia, et al.: SPRank: Semantic Path-based Ranking for Top-n Recommendations using Linked Open Data. In: ACM TIST, 2016 https://grouplens.org/datasets/
  7. 12/10/2019 Heiko Paulheim 7 Combining the Best of Three Worlds • DBpedia: detailed instance / relation extraction • YAGO: detailed classes • Cyc: rich axiomatization • Goal: get the best of all those worlds public private Paulheim: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8:3 (2017), pp. 489-508
  8. 12/10/2019 Heiko Paulheim 8 Towards CaLiGraph • YAGO uses categories for types – e.g., Category:American Industrial Groups – but does not analyze them further • :NineInchNails a :AmericanIndustrialGroup – “Things, not Strings”? • :NineInchNails a :MusicalGroup ; hometown :United_States ; genre :Industrial .
  9. 12/10/2019 Heiko Paulheim 9 Cat2Ax: Axiomatizing Wikipedia Categories Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...  dbo:Album  dbo:artist.{dbr:Nine_Inch_Nails}  dbo:genre.{dbr:Rock_Music}
  10. 12/10/2019 Heiko Paulheim 10 Cat2Ax: Axiomatizing Wikipedia Categories Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...  dbo:genre.{dbr:Rock_Music} ?  dbo:artist.{dbr:Rock_(Rapper)} ?
  11. 12/10/2019 Heiko Paulheim 11 Cat2Ax: Axiomatizing Wikipedia Categories – Frequency: how often does the pattern occur in a category? • i.e.: share of instances that have dbo:genre.{dbr.Rock_Music}? – Lexical score: likelihood of term as a surface form of object • i.e.: how often is Rock used to refer to dbr:Rock_Music? – Sibling score: how likely are sibling categories sharing similar patterns? • i.e., are there sibling categories with a high score for dbo:genre? Albums Albums  by genre  Albums  by artist  Nine Inch Nails  albums  The Doors  albums  Rock  albums  Pop  albums  Reggae  albums  The Beatles albums  ... ... ...
  12. 12/10/2019 Heiko Paulheim 12 Cat2Ax: Axiomatizing Wikipedia Categories • Results
  13. 12/10/2019 Heiko Paulheim 13 Improving Instance Coverage: Lists in Wikipedia • Only existing pages have categories – Lists may also link to non-existing pages
  14. 12/10/2019 Heiko Paulheim 14 CaLiGraph Example Category: Musical Groups established in 1987 List of symphonic metal bands Category: Swedish death metal bands List of Swedes in Music
  15. 12/10/2019 Heiko Paulheim 15 CaLiGraph Statistics • Class hierarchy from categories and list pages – 750k classes • Axioms from categories and list pages – 200k definitions • Instances from list pages – 870k new instances
  16. 12/10/2019 Heiko Paulheim 16 CaLiGraph Glitches
  17. 12/10/2019 Heiko Paulheim 17 The Future of CaLiGraph • Going beyond red links • Going beyond explicit lists
  18. 12/10/2019 Heiko Paulheim 18 Knowledge Graph Creation Beyond Wikipedia
  19. 12/10/2019 Heiko Paulheim 19 A Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Wikipedia Dump (+ mappings) • Output: – DBpedia DBpedia Extraction Framework
  20. 12/10/2019 Heiko Paulheim 20 An Even Higher Bird’s Eye View on DBpedia EF • DBpedia Extraction Framework • Input: – A Media Wiki Dump (+ mappings) • Output: – A Knowledge Graph DBpedia Extraction Framework
  21. 12/10/2019 Heiko Paulheim 21 What if…? • What if we went from Wikipedia every MediaWiki? • According to WikiApiary, there’s thousands...
  22. 12/10/2019 Heiko Paulheim 22 Why? • More is better (maybe)
  23. 12/10/2019 Heiko Paulheim 23 Why? • Overcoming Wikipedia’s coverage bias
  24. 12/10/2019 Heiko Paulheim 24 A Brief History of DBkWik • Started as a student project in 2017 • Task: run DBpedia EF on a large Wiki Farm – ...and see what happens
  25. 12/10/2019 Heiko Paulheim 25 DBkWik vs. DBpedia • Challenges – Getting dumps: only a fraction of Fandom Wikis has dumps – Downloadable from Fandom: 12,840 dumps – Tried: auto-requesting dumps
  26. 12/10/2019 Heiko Paulheim 26 Obtaining Dumps • We had to change our strategy: WikiTeam software – Produces dumps by crawling Wikis – Fandom has not blocked us so far :-) – Current collection: 307,466 Wikis → will go into DBkWik 1.2 release
  27. 12/10/2019 Heiko Paulheim 27 DBkWik vs. DBpedia • Mappings do not exist – no central ontology – i.e., only raw extraction possible • Duplicates exist – origin: pages about the same entity in different Wikis – unlike Wikipedia: often not explicitly linked • Different configurations of MediaWiki
  28. 12/10/2019 Heiko Paulheim 28 Absence of Mappings and Ontology • Every infobox becomes a class: {infobox actor → mywiki:actor a owl:Class • Every infobox key becomes a property |role = Harry’s mother → mywiki:role a rdf:Property • The resulting ontology is very shallow – No class hierarchy – No distinction of object and data properties – No domains and ranges
  29. 12/10/2019 Heiko Paulheim 29 Duplicates • Collecting Data from a Multitude of Wikis
  30. 12/10/2019 Heiko Paulheim 30 Representational Variety • No conventions across Wikis (besides using MediaWiki syntax) {{Person |name = Trent Reznor |image = TrentReznor.jpg |caption - Reznor at the [[83rd Academy Awards]] |nominations = 1 |wins = 1 |role = Composer |birthdate = May 17, 1965 |birthloc = Mercer, Pennsylvania, USA}} {{Infobox musician | Name = Trent Reznor | Birth_name = Michael Trent Reznor | Born = May 17, [[1965]] (age 53) | Origin = [[Mercer]], [[Pennsylvania]], [[United States]] ... }} {{Infobox cast |Name=Trent Reznor |Image= |ImageCaption= |character= |crew= |Born={{d|May|17|1965}}{{-}}New Castle, Pennsylvania, United States ... }
  31. 12/10/2019 Heiko Paulheim 31 Data Fusion
  32. 12/10/2019 Heiko Paulheim 32 Naive Data Fusion and Linking to DBpedia • String similarity for schema matching (classes/properties) • doc2vec similarity on original pages for instance matching • Results – Classes and properties work OK – Instances are trickier – Internal linking seems easier F1 score... Internal Linking Linking to DBpedia Classes .979 .898 Properties .836 .865 Instances .879 .657 maybe...
  33. 12/10/2019 Heiko Paulheim 33 Improving Linking and Fusion • Started a new track at OAEI in 2018 – annual benchmark for matching tools • In 2019, some tools beat the baseline – albeit by a small margin only
  34. 12/10/2019 Heiko Paulheim 34 Results Data Fusion • Uneven distribution – e.g., character appears 5k times • Currently: no multi-linguality – e.g., Main Page, Hauptseite • Probably overloaded fusion (false positives) – e.g., next, location
  35. 12/10/2019 Heiko Paulheim 35 Light-weight Schema Induction • Class hierarchy and domain/range induction – Using association rule mining • e.g., Artist(x) → Person(x) – 5k class subsumption axioms – 59k domain restrictions – 114k range restrictions • Instance typing – With a light-weight version of SDType – Using the learned ranges as approximations of actual distributions • Result: ~100k new instance types Person? Artist Person
  36. 12/10/2019 Heiko Paulheim 36 Big Picture Dump Downloader DBpedia Extraction Framework Interlinking Instance Matcher Schema Matcher MediaWiki Dumps Extracted RDF Internal Linking Instance Matcher Schema Matcher Consolidated Knowledge Graph DBkWik Linked Data Endpoint Ontology Knowledge Graph Fusion Instance Matcher Domain/ Range Type SDType Light SubclassMaterialization
  37. 12/10/2019 Heiko Paulheim 37 DBkWik 1.1 • Source: ~15k Wiki dumps from Fandom – 52.4GB of data (roughly the size of the English Wikipedia) Raw Final Instances 14,212,535 11,163,719 Typed instances 1,880,189 1,372,971 Triples 107,833,322 91,526,001 Avg. indegree 0.624 0.703 Avg. outdegree 7.506 8.169 Classes 71,580 12,029 Properties 506,487 128,566
  38. 12/10/2019 Heiko Paulheim 38 DBkWik 1.1 • Fused graphs from 15k Wikis http://dbkwik.webdatacommons.org/
  39. 12/10/2019 Heiko Paulheim 39 DBkWik 1.1 vs. other Knowledge Graphs • Caveat: – Minus non-recognized duplicates!
  40. 12/10/2019 Heiko Paulheim 40 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? • Challenge: – We only have an incomplete and partly correct mapping M – But: we know its precision P and recall R • Trick (see KI paper 2017): – O is the actual overlap (unknown), T  M is the true part of M (unknown) • By definition: – P = |T| / |M| → |T| = P * |M| – R = |T| / |O| → |T| = R * |O| → |O| = |M| * P / R DBkWik DBpedia
  41. 12/10/2019 Heiko Paulheim 41 DBkWik 1.1 vs. DBpedia • How complementary are DBkWik and Dbpedia? – |O| = |M| * P / R – Overlap: ~500k instances • In other words: – 95% of all entities in DBkWik are not in DBpedia – 90% of all entities in DBpedia are not in DBkWik DBkWik DBpedia
  42. 12/10/2019 Heiko Paulheim 42 NewNif Extractor Towards DBkWik 1.2 • Current crawl: 307,466 Wikis • Extraction: more robust for non-infobox templates – e.g., LyricWiki: 1.7M songs, 177k albums, 84k artists • Robust abstract extraction – using SWEBLE parser – no local MediaWiki instance • Better matching • New gold standard Source Simple WikiParser LinkExtractor Page NifExtractor AST Destination Graph HTML
  43. 12/10/2019 Heiko Paulheim 43 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  44. 12/10/2019 Heiko Paulheim 44 Towards DBkWik 1.2 • What to expect? – data from 307,466 wikis – 38,985,266 articles
  45. 12/10/2019 Heiko Paulheim 45 Further Open Challenges • More detailed profiling of knowledge graphs – e.g., do we reduce or increase bias? • Task-based downstream evaluations – Does it improve, e.g., recommender systems? • Fusion policies – e.g., identify outdated information
  46. 12/10/2019 Heiko Paulheim 46 Contributors • Contributors (past&present) Sven Hertling Alexandra Hofmann Samresh Perchani Jan Portisch Nicolas Heist
  47. 12/10/2019 Heiko Paulheim 47 Beyond DBpedia and YAGO – The New Kids on the Knowledge Graph Block Heiko Paulheim
Anzeige