Wikipedia is often used a source of surface forms, or alternative reference strings for an entity, required for entity linking, disambiguation or coreference resolution tasks. Surface forms have been extracted in a number of works from Wikipedia labels, redirects, disambiguations and anchor texts of internal Wikipedia links, which we complement with anchor texts of external Wikipedia links from the Common Crawl web corpus. We tackle the problem of quality of Wikipedia-based surface forms, which has not been raised before. We create the gold standard for the dataset quality evaluation, which reveales the surprisingly low precision of the Wikipedia-based surface forms. We propose filtering approaches that allowed boosting the precision from 75% to 85% for a random entity subset, and from 45% to more than 65% for the subset of popular entities. The filtered surface form dataset as well the gold standard are made publicly available.
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
Gathering Alternative Surface Forms for DBpedia Entities
1. Gathering Alternative Surface Forms
for DBpedia Entities
Volha Bryl
University of Mannheim, Germany Springer Nature
Christian Bizer, Heiko Paulheim
University of Mannheim, Germany
NLP & DBpedia @ ISWC 2015, Bethlehem, USA, October 11, 2015
2. Why you need Surface Forms
• Surface form (SF) of an entity is a collection of strings it can be
referred as to: synonyms, alternatives names, etc.
• Used to support many NLP tasks: co-reference resolution, entity
linking, disambiguation
2Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
3. Why you need Surface Forms
• Surface form (SF) of an entity is a collection of strings it can be
referred as to: synonyms, alternatives names, etc.
• Used to support many NLP tasks: co-reference resolution, entity
linking, disambiguation
“Billionaire Elon Musk has spelled out how he plans to
create temporary suns over Mars in order to heat the
Red Planet. Dismissing earlier comments that he
intended to nuke the planet’s surface, he says he wants
to create aerial explosions to heat it up. ”
--- to link the three entities, your machine should know that red planet is
an alternative name for Mars, and that Mars can be referred to just by its
“type” – planet
3Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
4. Surface Forms from Wiki(DB)pedia
• Some of Wikipedia’s (hence, DBpedia’s) crowd-sourced content look
quite like surface forms
• Page titles
• Redirects
• Account for alternative names, word forms (e.g. plurals), closely related words,
abbreviations, alternative spellings, likely misspellings, subtopics
• Disambiguation pages
• There are 10+ Bethlehem’s in US, according to
https://en.wikipedia.org/wiki/Bethlehem_(disambiguation)
• Anchor texts of links between wiki pages
Named after the Roman god of war, it is often referred to as the “Red
Planet”...
Source: Named after the [[Mars (mythology)|Roman god of war]], it is
often referred to as the "Red Planet“
• …additionally, we use anchor texts of links from external pages to Wikipedia
4Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
5. Surface Forms from Wiki(DB)pedia
• Not a new idea
• BabelNet, DBpedia Spotlight, … [see our paper for more links]
5Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Mars in BabelNet:
6. Surface Forms from Wiki(DB)pedia
• Not a new idea
• BabelNet, DBpedia Spotlight, … [see our paper for more links]
• Problem: Quality
• …it is not only that quality is a problem, it is also that it have never been
assessed or addressed
• Reason 1: good quality of Wikipedia content is taken for granted
• Reason 2: hopes are that NLP algorithms won’t be influenced by noise
6Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Mars in BabelNet:
7. Surface Forms from Wiki(DB)pedia
• Not a new idea
• BabelNet, DBpedia Spotlight, … [see our paper for more links]
• Problem: Quality – Why?
• By adding a redirect or an anchor text of internal Wikipedia link, a Wikipedia
editor might mean not only same as or also known as, but also related to,
contains, etc.
• Both variants serve the purpose of pointing to the correct wiki page
7Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
Mars in BabelNet:
8. Solution: Focus on Quality
• Step 1: Extract
• We extract SFs from Wikipedia labels, redirects, disambiguations, and anchor
texts of internal wiki-links
• Step 2: Evaluate
• We create a gold standard to evaluate the SFs quality
• Step 3: Filter
• We implement three filters to improve SFs quality
• Bonus: More SFs
• We extract SFs from anchor texts of Wiki links found in the Common Crawl
2014 corpus
• All datasets are available at
http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/
8Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
9. SFs Dataset Statistics
• LRD = Labels, Redirects, Disambiguations
• Extracted from DBpedia dumps
• WAT = Wikipedia Anchor Texts
• Extracted by a new DBpedia extractor (based on PageLinksExtractor)
9Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
10. Gold Standard
• Manual annotation, 1 annotator, 2 subsets
• Popular subset: manually selected 34 popular entities of different types
• Denmark, Berlin, Apple Inc., Animal Farm, Michael Jackson, Star Wars, Diego
Maradona, Mars, etc.
• ~82 SFs per entity, linked from other Wiki pages 813,736 times
• Random subset: randomly selected 81 entities each having at least 5 SFs
• Andy_Zaltzman, Bell AH-1 SuperCobra, Biarritz, Castellum, Firefox (film), Kipchak
languages, ParisTech, Psychokinesis, etc.
• ~13 SFs per entity , linked from other Wiki pages 14,760 times
Available at http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/gold/
10Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
11. Gold Standard
• Type of annotations
• correct (“the eternal city” for Rome),
• contained (“Google Japan” for Google), contains (“Turkey” for Istanbul),
• type of (“the city” for Rome)
• partial (“Diego” for Diego Maradona)
• related (“Google Blog” for Google)
• wrong (“during World War I” for United States)
11Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
12. Evaluation: How many correct SFs?
• SFs extracted from labels, redirects, disambiguations
• correct, popular subset: 66.8%
• correct, random subset: 86.6%
• SFs extracted from Wikipedia anchor texts
• correct, popular subset: 38.5%
• correct, random subset: 70.7%
• Combined dataset
• correct, popular subset: 45.7%
• correct, random subset: 75%
12Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
13. (1) Filtering: String Patterns
• Data analysis there are patterns wrong SFs follow
• URLs: contain .com or .net (“Berlin-china.net” for Berlin)
• of-phrases, with the exceptions for city of, state of, and the like (“Issues of
Toronto” for Toronto)
• in-phrases (“Historical sites in Berlin” for Berlin)
• and-phrases (“Tom Cruise and Katie Holmes” for Tom Cruise)
• list-of (“List of Toronto MPs and MPPs” for Toronto)
• Increase in precision
• popular subset: 1.33%
• popular subset, LRD only: 3.75%
• random subset: less than 1%
13Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
14. (2) Filtering: Wikidata
• Observation: some SFs are entities on their own in other languages
• E.g. “Neckarau” city area of Mannheim redirects to Mannheim in English
Wikipedia, but has its own page in German Wikipedia
• Implementation: use DBpedia- Wikidata dumps, released in May 2015
• Check whether a SF exactly matches or is close (Levenshtein distance) to any
of the labels of Wikidata entities that do not have English but have other
Wikipedia pages
• Increase in precision
• 0.5% compared to pattern-based filtering
• 1.5% for SF extracted only from LRD
14Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
15. (3) Filtering: Frequency Scores
• For SFs extracted from anchor texts, frequencies are available
TF-IDF scores
• Determining the threshold: 1.0 .. 8.0 values with a step of 0.2 evaluated
•Two thresholds selected, highest values of F1: 1.8 and 2.6
•Threshold 0 (no filtering) used as baseline
• Increase in precision
•20% for popular subset, 10% for random subset
* Filtering done on the dataset to which pattern- and Wikidata-based filters are already applied
15Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
16. SFs from Common Crawl
• Common Crawl (CC) is the largest publicly available web corpus
• Extraction done on Winter 2014 CC Corpus, in the context of the Web
Data Commons project
• http://webdatacommons.org/ -- extracting and providing for public download
various types of structured data from CC
• Data required a lot of cleaning
• 3M SFs added to our LRD&WAT corpus
• No annotated gold standard: left for future work
• Available at
http://data.dws.informatik.uni-mannheim.de/dbpedia/nlp2014/lrd-cc/
16Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim
17. Conclusion and Future Work
• Main message
• quality of Wikipedia-base surface forms is often overlooked!
• Contributions
• Gold standard SFs, made available
• 3 filtering strategies: precision improved by > 20% for popular Wikipedia
entities, for > 10% for random entities
• Extracted SFs from Common Crawl corpus
• All data publicly available
• Future work directions
• Task-based evaluation of the resource, further work on the gold standard
17Surface Forms for DBpedia Entities, Bryl, Bizer, Paulheim