Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, an iterative bootstrapping strat- egy for extracting RDF from unstructured data. The idea behind BOA is to use the Data Web as background knowledge for the extraction of natural language patterns that represent predicates found on the Data Web. These patterns are used to extract instance knowledge from natu- ral language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. We evaluate our approach on two data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high ac- curacy. Moreover, we provide the first repository of natural language representations of predicates found on the Data Web.
2. Bootstrapping the Data Web
Motivation
๏ Most knowledge bases extracted from (semi)-
structured data
๏ Only 15-20 % of information in structured data
๏ Semantic Web ⬌ Document Web
๏ How can we extract data from the document-
oriented web?
WeKEx@ISWC - 17.01.2012 - Page 2 http://boa.aksw.org
3. Bootstrapping the Data Web
Idea I
dbpedia:Barack_Obama
dbpedia-owl:birthPlace
dbpedia-owl:spouse
dbpedia-owl:party
dbpedia:Honolulu,_Hawaii
dbpedia:Michelle_Obama
dbpedia:Democratic_Party
WeKEx@ISWC - 17.01.2012 - Page 3 http://boa.aksw.org
4. Bootstrapping the Data Web
Idea II
Barack Obama was born in Honolulu, Hawaii.
is a politician of the
Barack Hussein Obama is a politician of the Democratic Party.
Obama married Michelle Robinson in 1992.
WeKEx@ISWC - 17.01.2012 - Page 4 http://boa.aksw.org
5. Bootstrapping the Data Web
Idea III
married is a politician of the
Jackie Bouvier Kennedy Onassis who
married John F. Kennedy was tied to Joseph Martin "Joschka" Fischer (born 1948-04-12)
the Auchinclosses via her sister's is a politician of the German Green Party.
marriage into the Auchincloss family.
was born in Dietrich's only child, Maria
Elisabeth Sieber, was born in
Berlin on 13 December 1924.
WeKEx@ISWC - 17.01.2012 - Page 5 http://boa.aksw.org
6. Bootstrapping the Data Web
Related Work
๏ ReadTheWeb Project: N(ever) E(nding) L(anguage) L(earner)
๏ PROSPERA: Scalable Knowledge Harvesting with High Precision
and High Recall
WeKEx@ISWC - 17.01.2012 - Page 6 http://boa.aksw.org
7. Bootstrapping the Data Web
The BOA approach
Use in next
Knowledge Acquisition iteration Filtering
Data Web
SPARQL
2 3
Background
Web Knowledge Pattern
Pattern Scoring
Patterns
1 Search
Corpus Extraction
4
Crawler Indexer
Cleaner RDF
Corpora Generation 5
WeKEx@ISWC - 17.01.2012 - Page 7 http://boa.aksw.org
9. Bootstrapping the Data Web
Pattern Search
(1) Set of entities s and o connected through p
(2) Find all sentences which contain s and o
(3) Replace labels with variables (?D?, ?R?)
BOA pattern: BOA pattern mapping:
dbpedia-owl:spouse
dbpedia-owl:spouse dbpedia-owl:spouse
“?D? with his wife ?R?”
“?D? with his wife ?R?” “?D? and her husband ?R?”
dbpedia-owl:spouse
“?D? and his wife ?R?”
WeKEx@ISWC - 17.01.2012 - Page 9 http://boa.aksw.org
10. Bootstrapping the Data Web
Pattern Scoring - Support
Support
pattern should be used across several triples in background knowledge
subsidiary ↣ “?R? was acquired by ?D?”
๏ [Google, DoubleClick] ↣ 2
๏ [General Motors, Opel] ↣ 1
๏ [Cablevision, Rainbow Media] ↣ 4
WeKEx@ISWC - 17.01.2012 - Page 10 http://boa.aksw.org
11. Bootstrapping the Data Web
Pattern Scoring - Specificity
Specificity
pattern should not be used by many pattern mappings
๏ subsidiary: “?D? agreed to buy ?R?”
๏ subsidiary: “?R? is a part of ?D?”
๏ foundationOrganisation: “?R? is a part of ?D?”
WeKEx@ISWC - 17.01.2012 - Page 11 http://boa.aksw.org
12. Bootstrapping the Data Web
Pattern Scoring - Typicity
Typicity
pattern should be used to connect entities of correct type
๏ Hypercom was acquired by Verifone .
๏ Hypercom_ORG was_O acquired_O by_O Verifone_ORG ._O
๏ Maktoob was acquired by Yahoo!
๏ Maktoob_PER was_O acquired_O by_O Yahoo_ORG ._O
WeKEx@ISWC - 17.01.2012 - Page 12 http://boa.aksw.org
13. Bootstrapping the Data Web
RDF Generation
?D? with his wife ?R?
Pacheco arrived with his wife Leyla Rodriguez Stahl and several...
Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O
NEW
dbpedia-owl:spouse NEW
dbpedia:Abel_Pacheco boa:Leyla_Rodriguez_Stahl
rdf:type rdf:type
rdfs:label NEW rdfs:label
dbpedia- NEW
dbpedia-
‘‘Abel Pacheco’’@en owl:Person ‘‘Leyla Rodriguez Stahl’’@en
owl:Person
WeKEx@ISWC - 17.01.2012 - Page 13 http://boa.aksw.org
14. Bootstrapping the Data Web
Evaluation I
riverMouth
musicalArtist
musicalBand
# of triples
en-wiki en-news award
writer
almaMater
occupation
Language english english formerTeam
deathPlace
general birthPlace
Topic news
knowledge
# of lines 44.7M 256.1M
riverMouth 158697
musicalArtist
musicalBand is object
award # of triples
is subject
# of words 1,032.1M 5,068.7M
writer
551693
almaMater 327430
occupation
137990
formerTeam
deathPlace 72820 64239
birthPlace
Place Person Organisation
WeKEx@ISWC - 17.01.2012 - Page 14 http://boa.aksw.org
15. Bootstrapping the Data Web
Evaluation II
en-wiki en-news
LOC PER ORG LOC PER ORG
Triples extracted 1465 8817 2567 488 903 916
Triples in DBpedia 138 183 48 52 44 7
Evaluated Triples 100 (8) 100 (1) 100 (1) 100 (1) 100 (7) 100 (0)
Precision 90,5 97 99 61,5 73,5 91
New true Statements* 1200 8375 2494 268 631 827
Found pattern mappings 62 72 59 49 70 55
Found patterns 123k 136k 38k 569k 465k 92k
Scored patterns 1045 612 241 3832 7294 1077
* Number of extracted statements not found in DBpedia multiplied with the precision of our approach
WeKEx@ISWC - 17.01.2012 - Page 15 http://boa.aksw.org
16. Bootstrapping the Data Web
Future Work
๏ Iteration 1+
๏ Human feedback
๏ Pattern generalization
๏ Datatype Properties
๏ Languages/Corpora
๏ Webservices
WeKEx@ISWC - 17.01.2012 - Page 16 http://boa.aksw.org
17. Bootstrapping the Data Web
Conclusion
๏ No manual created seed patterns needed
๏ 95.5% Precision on DBpedia/Wikipedia
๏ Output easily integrable in LOD Cloud
๏ Library of natural-language representations of
formal relations, Demo
๏ Quasi language independent (German/Korean)
WeKEx@ISWC - 17.01.2012 - Page 17 http://boa.aksw.org