Gathering Alternative Surface Forms for DBpedia Entities

1.266 Aufrufe

Veröffentlicht am

Wikipedia is often used a source of surface forms, or alternative reference strings for an entity, required for entity linking, disambiguation or coreference resolution tasks. Surface forms have been extracted in a number of works from Wikipedia labels, redirects, disambiguations and anchor texts of internal Wikipedia links, which we complement with anchor texts of external Wikipedia links from the Common Crawl web corpus. We tackle the problem of quality of Wikipedia-based surface forms, which has not been raised before. We create the gold standard for the dataset quality evaluation, which reveales the surprisingly low precision of the Wikipedia-based surface forms. We propose filtering approaches that allowed boosting the precision from 75% to 85% for a random entity subset, and from 45% to more than 65% for the subset of popular entities. The filtered surface form dataset as well the gold standard are made publicly available.

Veröffentlicht in: Daten & Analysen
0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.266
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
9
Aktionen
Geteilt
0
Downloads
8
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Gathering Alternative Surface Forms for DBpedia Entities

  1. 1. Gathering Alternative Surface Forms for DBpedia Entities Volha Bryl University of Mannheim, Germany  Springer Nature Christian Bizer, Heiko Paulheim University of Mannheim, Germany NLP & DBpedia @ ISWC 2015, Bethlehem, USA, October 11, 2015
  2. 2. Why you need Surface Forms • Surface form (SF) of an entity is a collection of strings it can be referred as to: synonyms, alternatives names, etc. • Used to support many NLP tasks: co-reference resolution, entity linking, disambiguation 2Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  3. 3. Why you need Surface Forms • Surface form (SF) of an entity is a collection of strings it can be referred as to: synonyms, alternatives names, etc. • Used to support many NLP tasks: co-reference resolution, entity linking, disambiguation “Billionaire Elon Musk has spelled out how he plans to create temporary suns over Mars in order to heat the Red Planet. Dismissing earlier comments that he intended to nuke the planet’s surface, he says he wants to create aerial explosions to heat it up. ” --- to link the three entities, your machine should know that red planet is an alternative name for Mars, and that Mars can be referred to just by its “type” – planet 3Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  4. 4. Surface Forms from Wiki(DB)pedia • Some of Wikipedia’s (hence, DBpedia’s) crowd-sourced content look quite like surface forms • Page titles • Redirects • Account for alternative names, word forms (e.g. plurals), closely related words, abbreviations, alternative spellings, likely misspellings, subtopics • Disambiguation pages • There are 10+ Bethlehem’s in US, according to https://en.wikipedia.org/wiki/Bethlehem_(disambiguation) • Anchor texts of links between wiki pages Named after the Roman god of war, it is often referred to as the “Red Planet”... Source: Named after the [[Mars (mythology)|Roman god of war]], it is often referred to as the "Red Planet“ • …additionally, we use anchor texts of links from external pages to Wikipedia 4Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  5. 5. Surface Forms from Wiki(DB)pedia • Not a new idea • BabelNet, DBpedia Spotlight, … [see our paper for more links] 5Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim Mars in BabelNet:
  6. 6. Surface Forms from Wiki(DB)pedia • Not a new idea • BabelNet, DBpedia Spotlight, … [see our paper for more links] • Problem: Quality • …it is not only that quality is a problem, it is also that it have never been assessed or addressed • Reason 1: good quality of Wikipedia content is taken for granted • Reason 2: hopes are that NLP algorithms won’t be influenced by noise 6Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim Mars in BabelNet:
  7. 7. Surface Forms from Wiki(DB)pedia • Not a new idea • BabelNet, DBpedia Spotlight, … [see our paper for more links] • Problem: Quality – Why? • By adding a redirect or an anchor text of internal Wikipedia link, a Wikipedia editor might mean not only same as or also known as, but also related to, contains, etc. • Both variants serve the purpose of pointing to the correct wiki page 7Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim Mars in BabelNet:
  8. 8. Solution: Focus on Quality • Step 1: Extract • We extract SFs from Wikipedia labels, redirects, disambiguations, and anchor texts of internal wiki-links • Step 2: Evaluate • We create a gold standard to evaluate the SFs quality • Step 3: Filter • We implement three filters to improve SFs quality • Bonus: More SFs • We extract SFs from anchor texts of Wiki links found in the Common Crawl 2014 corpus • All datasets are available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/ 8Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  9. 9. SFs Dataset Statistics • LRD = Labels, Redirects, Disambiguations • Extracted from DBpedia dumps • WAT = Wikipedia Anchor Texts • Extracted by a new DBpedia extractor (based on PageLinksExtractor) 9Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  10. 10. Gold Standard • Manual annotation, 1 annotator, 2 subsets • Popular subset: manually selected 34 popular entities of different types • Denmark, Berlin, Apple Inc., Animal Farm, Michael Jackson, Star Wars, Diego Maradona, Mars, etc. • ~82 SFs per entity, linked from other Wiki pages 813,736 times • Random subset: randomly selected 81 entities each having at least 5 SFs • Andy_Zaltzman, Bell AH-1 SuperCobra, Biarritz, Castellum, Firefox (film), Kipchak languages, ParisTech, Psychokinesis, etc. • ~13 SFs per entity , linked from other Wiki pages 14,760 times Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/gold/ 10Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  11. 11. Gold Standard • Type of annotations • correct (“the eternal city” for Rome), • contained (“Google Japan” for Google), contains (“Turkey” for Istanbul), • type of (“the city” for Rome) • partial (“Diego” for Diego Maradona) • related (“Google Blog” for Google) • wrong (“during World War I” for United States) 11Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  12. 12. Evaluation: How many correct SFs? • SFs extracted from labels, redirects, disambiguations • correct, popular subset: 66.8% • correct, random subset: 86.6% • SFs extracted from Wikipedia anchor texts • correct, popular subset: 38.5% • correct, random subset: 70.7% • Combined dataset • correct, popular subset: 45.7% • correct, random subset: 75% 12Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  13. 13. (1) Filtering: String Patterns • Data analysis  there are patterns wrong SFs follow • URLs: contain .com or .net (“Berlin-china.net” for Berlin) • of-phrases, with the exceptions for city of, state of, and the like (“Issues of Toronto” for Toronto) • in-phrases (“Historical sites in Berlin” for Berlin) • and-phrases (“Tom Cruise and Katie Holmes” for Tom Cruise) • list-of (“List of Toronto MPs and MPPs” for Toronto) • Increase in precision • popular subset: 1.33% • popular subset, LRD only: 3.75% • random subset: less than 1% 13Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  14. 14. (2) Filtering: Wikidata • Observation: some SFs are entities on their own in other languages • E.g. “Neckarau” city area of Mannheim redirects to Mannheim in English Wikipedia, but has its own page in German Wikipedia • Implementation: use DBpedia- Wikidata dumps, released in May 2015 • Check whether a SF exactly matches or is close (Levenshtein distance) to any of the labels of Wikidata entities that do not have English but have other Wikipedia pages • Increase in precision • 0.5% compared to pattern-based filtering • 1.5% for SF extracted only from LRD 14Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  15. 15. (3) Filtering: Frequency Scores • For SFs extracted from anchor texts, frequencies are available  TF-IDF scores • Determining the threshold: 1.0 .. 8.0 values with a step of 0.2 evaluated •Two thresholds selected, highest values of F1: 1.8 and 2.6 •Threshold 0 (no filtering) used as baseline • Increase in precision •20% for popular subset, 10% for random subset * Filtering done on the dataset to which pattern- and Wikidata-based filters are already applied 15Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  16. 16. SFs from Common Crawl • Common Crawl (CC) is the largest publicly available web corpus • Extraction done on Winter 2014 CC Corpus, in the context of the Web Data Commons project • http://webdatacommons.org/ -- extracting and providing for public download various types of structured data from CC • Data required a lot of cleaning • 3M SFs added to our LRD&WAT corpus • No annotated gold standard: left for future work • Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/lrd-cc/ 16Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
  17. 17. Conclusion and Future Work • Main message • quality of Wikipedia-base surface forms is often overlooked! • Contributions • Gold standard SFs, made available • 3 filtering strategies: precision improved by > 20% for popular Wikipedia entities, for > 10% for random entities • Extracted SFs from Common Crawl corpus • All data publicly available • Future work directions • Task-based evaluation of the resource, further work on the gold standard 17Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim

×