Thanks to its wide coverage and general-purpose ontology, DBpedia is a prominent dataset in the Linked Open Data cloud. DBpedia's content is harvested from Wikipedia's infoboxes, based on manually created mappings. In this paper, we explore the use of a promising source of knowledge for extending DBpedia, i.e., Wikipedia's list pages. We discuss how a combination of frequent pattern mining and natural language processing (NLP) methods can be leveraged in order to extend both the DBpedia ontology, as well as the instance information in DBpedia. We provide an illustrative example to show the potential impact of our approach and discuss its main challenges.
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Extending DBpedia with Wikipedia List Pages
1. Extending DBpedia
with Wikipedia List Pages
10/22/13 Paulheim, Simone Paolo Simone Paolo Ponzetto
Heiko Paulheim, Ponzetto
Heiko
1
2. Disclaimer
•
This presentation shows an idea
– after all, it says “position paper”
– We don't know if it works!
– (but we are quite confident)
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
2
3. Lists in Wikipedia
•
Wikipedia loves lists
•
As of June 2013, there are almost 600,000 list pages
•
Lists organize Wikipedia pages
– that correspond to DBpedia instances
•
Example:
– List of African-American writers
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
3
5. Lists in Wikipedia
•
Different types of lists
– simple bullet point lists
– broken bullet point lists (i.e., different sections)
• sometimes, the sections are semantically meaningful
– tables
– ...
Simple Bullet List
Broken Bullet List
Table
Other
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
5
6. Lists in Wikipedia
•
What information is in a list?
– the linked things have the same “type”
•
The type can be a complex construct
– e.g., Writer∩∀ nationality. {United States}∩∀ ethnicity.{African American}
•
Sometimes, there are more information bits
– e.g., birth dates for persons
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
6
7. Extracting Information from Lists
•
Goal:
– find the common characteristics of all things in the list
•
Example: African-American writers
– all instances are writers
25%
– all instances have nationality=United_States
– all instances have ethnicity=African_American
•
12%
3%
Information in DBpedia is far from complete
– makes extraction difficult
– but: big potential to add information to DBpedia
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
7
8. Extracting Information from Lists
•
Possible approach: finding characteristics with high TF-IDF
– TF: percentage of instances in the list that carry characteristic
– IDF: 1 / (percentage of all DBpedia instances that carry characteristic)
•
Rationale: only going by frequency would rate owl:Thing the highest
•
Example: African-American writers
– type=Writer: 0.608 (maximal across all possible classes)
– nationality=United_States: 0.277
– ethnicity=African_American: 0.127
•
But:
– deathPlace=New_York_City: 0.157 :-(
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
8
9. Extracting Information from Lists
•
Example: African-American writers
– ethnicity=African_American: 0.127
– deathPlace=New_York_City: 0.157
•
Exploit further information from list page
– e.g., wiki:African_American is linked from page, New_York_City is not
– e.g., analyze list page title, e.g., using DBpedia Spotlight
• African_American is recognized as an entity
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
9
10. Lists of Lists in Wikipedia
•
Wikipedia also knows ~600 lists of lists
– organize lists
– form a hierachy
•
E.g.:
– Lists of Writers
– Lists of American writers
– List of African American writers
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
10
11. From Lists of Lists to an Extended Ontology
•
Idea:
– find corresponding lists of... pages for DBpedia classes
– extend hierarchy
owl:Thing
...
Agent
...
Person
Corresponding Wikipedia page:
Artist
...
DBpedia Ontology
...
Extended Ontology ...
Lists of Writers
Writer
African-American Writer
10/22/13
Lists of American Writers
American Writer
...
List of African-American Writers
Heiko Paulheim, Simone Paolo Ponzetto
11
12. Potential of the Idea
•
Given that we extract everything correctly from
List of African American writers, we get
– 814 new type statements (only DBpedia ontology)
– 1409 new property assertions
– two entirely new instances
•
...and there are ~600,000 list pages
– extrapolation: we can roughly double the information in DBpedia
•
many list pages contain extra information
– e.g., birth places and birth dates of persons
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
12
13. Challenges
•
Robust extraction of instances
– from different kinds of list pages
– e.g., picking the right column in a table
– tables and bullet point lists already make for 75%
•
Picking good scoring functions
– TF-IDF seems not bad at first glance
•
Combining statistical and textual evidence
•
Scalable implementation
– Advantage: perfectly parallelizable
10/22/13
Heiko Paulheim, Simone Paolo Ponzetto
13