Introduction to DBpedia, the most popular and interconnected source of Linked Open Data. Part of EXPLORING WIKIDATA AND THE SEMANTIC WEB FOR LIBRARIES at METRO http://metro.org/events/598/
3. WHAT IT IS
DBpedia is a crowd-sourced community effort
to extract structured information from Wikipedia
and make this information available on the Web in
the form of Linked Open Data.
7. Connected with other Linked Datasets by 50 million RDF links
Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows
CENTRAL INTERLINKING HUB OF THE WEB OF DATA
8. Web of Data Browsing and Crawling
Web Data Integration and Mashups
9. “Which albums did Miles Davis record with female
instrumentalists?”
“Which populated places in Australia are below
sea level?”
“What did Andy Warhol and Thelonious Monk have
in common ?”
12. “THINGS”
Each thing in the DBpedia dataset is identified by
a URI of the form
http://dbpedia.org/resource/Name
Name is derived from the URL of the source
Wikipedia article, which has the form
http://en.wikipedia.org/wiki/Name.
17. “Facts” as RDF Triples
has name
Subject Predicate Object
(Thing)
Billie Holiday
18. GENERATING FACTS FOR THE ENTITY BILLIE HOLIDAY
has name
Subject Predicate Object
S <http://dbpedia.org/resource/Billie_Holiday>
P <http://xmlns.com/foaf/0.1/name>
O ”Billie Holiday”
Billie Holiday
22. HARVESTING FACTS
Wikipedia articles consist mostly of free text, but also
contain different types of structured information, such
as infobox templates, categorization information,
images, geo-coordinates, and links to external
Web pages.
26. The core of DBpedia consists
of an infobox extraction process.
Infoboxes are templates
contained in many Wikipedia
articles. They are usually
displayed in the top right corner
of articles and contain factual
information.
31. INFOBOX EXTRACTION
Raw Infobox Extraction – create triples directly
from the infobox data.
Mapping-based Infobox Extraction – mappings
against the DBpedia Ontology.
35. RAW INFOBOX EXTRACTION
Pros:
Complete coverage of all the infobox attributes
(not all the infoboxes have been mapped yet)
Cons:
Lower data quality (synonyms are not resolved
e.g., paceOfBirth/birthPlace; high error rate to
determine the datatype of an attribute value)
36. MAPPING-BASED INFOBOX EXTRACTION
Pros:
Data is cleaner (typing resources, merging name
variants, assigning specific datatypes to the
values).
Cons:
Not full coverage.
4.58 million things
4.22 million are classified in a consistent ontology.
39. THE DBPEDIA ONTOLOGY
Cross-domain ontology
Large thematic coverage
Currently covers 685 classes which form
a subsumption hierarchy and 2,795 different
pr oper ties describing the c lasses
(aircraftHelicopterAttack)
Shallow (≤ 5 levels)
40. THE DBPEDIA ONTOLOGY
Because the DBpedia Ontology is built upon
infobox templates, its semantic structure suffers
from a lack of logical consistency and present
significant semantic gaps in the hierarchy.
44. WIKIPEDIA CATEGORY SYSTEM
Wikipedia categories to group articles that share
similar subjects.
Wikipedia categories are constantly evolving and
currently number more than 740,000.
80.9 million links to Wikipedia categories.
45. WIKIPEDIA CATEGORY SYSTEM
Most categories are assigned manually by
Wikipedia contributors and can be found listed as
links at the bottom of a Wikipedia article.
46.
47. CATEGORIZING PEOPLE
At least four categories:
• the year the person was born
• the year they died
• their nationality
• their reason for being notable.
48. CATEGORIZATION OF PEOPLE
First sentence of an article:
Billie Holiday (born Eleanora Fagan; April 7, 1915 – July
17, 1959) was an American jazz singer and songwriter.
Year born: Category:1915 births
Year died: Category:1959 deaths
Nationality: Category: American people
Reason for notability / Occupation: Category:Musicians
49.
50. WIKIPEDIA CATEGORY SYSTEM
Collaborative effort
Advantages à categories are continually
updated to correspond with article content.
Dis/advantages à lack of consistency in its
hierarchical structure and “rather loose
relatedness between articles” (Bizer et al.
(2009). “Messy hierarchy”
51. RE-CATEGORIZATION OF BILLIE HOLIDAY
(→External links: re-categorisation per
Wikipedia:Categories for discussion/Log/2014 December
26, replaced: Category:American women composers
→ Category:American female composers) (undo) --
(Robot - Moving category African-American female
musicians toCategory:African-American musicians per
CFD at Wikipedia:Categories for discussion/Log/2013
January 10.)
52. WIKIPEDIA ONTOLOGY IN DBPEDIA
The hierarchical structure of the categories is
represented in DBpedia by way of two different
properties:
dcterms:subject (relate entity to category)
skos:broader (relate child to parent category)
55. YAGO ONTOLOGY
A robust classification scheme with a deep hierarchical
structure.
Originally derived from the Wikipedia category system
using the semantic lexicon WordNet.
Over 350,000 classes; 100 relationships
Provides DBpedia data with coherence and structural
consistency
A taxonomic backbone
56. QUERYING DBPEDIA FOR LINKED JAZZ
Jazz Name Vocabulary
Personal name vocabulary in the form of RDF
statements including the artist’s name paired
with a Uniform Resource Identifier (URI).
<http://dbpedia.org/resource/Billie_Holiday>!
<http://xmlns.com/foaf/0.1/name> !
“Billie Holiday”
57. QUERYING DBPEDIA FOR LINKED JAZZ
DBpedia was initially queried for literal triples with a
foaf:name predicate that satisfied the following criteria:
1. the entity must be an rdf:type of dbpedia-
owl:MusicalArtist
2. must have dbpedia:genre property: dbpedia:Jazz.
58. QUERYING DBPEDIA FOR LINKED JAZZ
DBpedia was initially queried for literal triples with a foaf:name predicate
that satisfied the following criteria:
1. the entity must be an rdf:type of dbpedia-owl:MusicalArtist
2. must have dbpedia:genre property: dbpedia:Jazz.
+
rdfs:label à name of the resource
59. QUERYING DBPEDIA FOR LINKED JAZZ
Prominent musicians who we expected to find by
querying dbpedia:Jazz property were not returned.
Example: “Count Basie”
- f e l l u n d e r d b p e d i a : S w i n g _ m u s i c ,
dbpedia:Big_band_music and dbpedia:Piano_blues
- not under dbpedia:Jazz
This required us to revise our query method by
expanding it to include additional relevant music genres.
62. IN SUM
New type of knowledge representation
environment
-constant state of flux.
-decentralized interplay of different descriptive
and classification systems.
-it challenges our tolerance threshold for data
quality and our traditional notion of authority
control.