Lodie overview

Background
Methodology
Evaluation
Conclusions
References

LODIE
Linked Open Data Information Extraction

Fabio Ciravegna Anna Lisa Gentile Ziqi Zhang

OAK Group, Department of Computer Science
The University of Shefﬁeld, UK

9th October 2012
Semantic Web and Information Extraction
SWAIE @ EKAW 2012

Fabio Ciravegna, Anna Lisa Gentile, Ziqi Zhang 1 / 19

Background
Methodology
Evaluation
Conclusions
References

LODIE: Overview

Linked Open Data for Web-scale Information Extraction
Web-scale IE
number of documents, domains, facts
efficient and effective methods required
Linked Open Data to seed learning
“[. . . ] a recommended best practice for exposing, sharing, and
connecting data [. . . ] using URIs and RDF”(linkeddata.org).
a large KB of typed instances, relations, annotations (e.g., RDFa)
Adapting to specific user information needs
users define specific IE tasks by specifying the types of instances
and relations to be learnt


Background
Methodology
Evaluation
Conclusions
References

Challenges: user Information needs

SoA deﬁnes a generic IE task - KnowItAll, StatSnowball,
PROSPERA, NELL, ExtremeExtraction [3, 4, 5, 11]
extracts “people, oragnisation, location" etc and their
generic relations

RQ how to let users deﬁne Web-IE tasks tailored to their
own needs - “drugs that treat hayfever"


Background
Methodology
Evaluation
Conclusions
References

Challenges: training data

SoA requires certain amount of training/learning resources to
be manually speciﬁed

RQ how to automatically obtain these (and ﬁlter noise) from
the LOD


Background
Methodology
Evaluation
Conclusions
References

Challenges: learning strategies

SoA Typically semi-supervised bootstrapping based learning
from unstructured texts, prone to propagation of errors

RQ how to combine multi-strategy learning (e.g., from both
structured and unstructured contents) to avoid drifting
away from the learning task


Background
Methodology
Evaluation
Conclusions
References

Challenges: publication of triples

SoA No integration with existing KB

RQ how to integrate with LOD


Background
Methodology
Evaluation
Conclusions
References

LODIE Architecture Overview

Figure: Architecture diagram.

Background
Methodology
Evaluation
Conclusions
References

User needs formalisation

Goal: Support users in formalising their information needs in a
machine understandable format
Hypothesis
Users deﬁne information needs in terms of ontologies
Users use different vocabularies in ontology creation

Methods
Baseline: manually identify relevant ontologies on the LOD and
deﬁne a view on them using tools like neon-toolkit.org
Ontology Design Pattern: reuse existing Content ODP building
block and apply re-engineering patterns to bridge the
“vocabulary gap”


Background
Methodology
Evaluation
Conclusions
References

Learning Seed Identiﬁcation and Filtering - I
Goal 1: Automatically identify training data in the forms of triples and
annotations to seed learning
Hypothesis:
LOD can already contain answers to user needs in the forms of
triples and annotations
The Web contains additional linguistic realisations of triples

Method
From LOD - SPQRL queries to fetch seed triples (and
annotations) matching the user needs
From the Web - Search for linguistic realisations of triples
(identiﬁed above):
co-occurrence of related instances in textual contexts e.g.
sentences
structural elements e.g., tables


Background
Methodology
Evaluation
Conclusions
References

Learning Seed Identiﬁcation and Filtering - II
Goal 2: Filter noisy training data and select the most informative
examples for learning
Hypothesis:
Identiﬁed learning seeds can contain noise (causing
“drifting-away”)
... and can be redundant (causing unnecessary overheads)
good learning examples are consistent w.r.t. the learning task
and diverse.
Method
Consistency measure - cluster seed instances of different classes
N times (varying parameters), does i always appear in the cluster
representing the same class?
Variability measure - cluster seed instances of the same class,
how many clusters are generated and how dense are they

Background
Methodology
Evaluation
Conclusions
References

Multi-Strategy Learning

Goal: Learning from different types (e.g., structured, unstructured)
with different strategies to improve both recall and precision
Hypothesis:
The same pieces of knowledge can be repeated in different forms,
e.g., in tables v.s. sentences (re-enforcing evidenece, Precision)
Some knowledge may be found only in one form or another
(Recall)

Method: multi-strategy learning
Learning from structures such as tables and lists [10, 8]
Inducing wrappers for regular pages [7]
Lexical-syntactic pattern learning from free texts
Combine outputs from different processes


Background
Methodology
Evaluation
Conclusions
References

Integration with the LOD

Goal: Assign unique identiﬁer to the extracted knowledge
Hypothesis:
Knowledge that already exists in the LOD can be re-extracted and
must be integrated

Method:
simple, scalable disambiguation methods, e.g., by feature
overlapping [2] and string distance metrics [9]


Background
Methodology
Evaluation
Conclusions
References

User Feedback

Goal: Integrate user’s feedback on learning
Hypothesis:
Automatic extraction is imperfect and user’s feedback can help
improve learning

Method:
Expose extracted knowledge via an interface and collect user
feedback - errors can be used as negative example for re-training


Background
Methodology
Evaluation
Conclusions
References

Evaluation

suitability to formalise the user needs

suitability of the approach to IE


Background
Methodology
Evaluation
Conclusions
References

Evaluation: suitability to formalise the user needs

Task described in natural language –> equivalent IE task
feasibility and efﬁciency (reasonable time, limited overhead)
effectiveness
is result representative of the user needs?
users judge for description of task
users judge for resulting instances
is result suitable to seeding IE?
usefulness of triples in terms to learning using the proposed quality
measures


Background
Methodology
Evaluation
Conclusions
References

Evaluation: suitability of the approach to IE

deﬁnition of new task: population of sections of the schema.org
ontology
precision
partial evaluation of recall
fraction of the available annotated instances
checking recall with respect to already available annotated
instances, not provided for training

comparative large scale IE evaluations
TAC Knowledge Base Population [6]
TREC Entity Extraction task [1]


Background
Methodology
Evaluation
Conclusions
References

Impact

LODIE timeliness
LOD: ﬁrst very large-scale information resource available for IE
covering for a growing number of domains
LODIE output
Web-scale IE task corpora, linked resources, etc.
developed code available as open source under the MIT licence
all the data generated will be made available using a licence such
as Open Data Commons (opendatacommons.org)


Background
Methodology
Evaluation
Conclusions
References

Krisztian Balog and Pavel Serdyukov.
Overview of the TREC 2010 Entity Track.
In Proceedings of the Nineteenth Text REtrieval Conference (TREC 2010). NIST, 2011.

S Banerjee and T Pedersen.
An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet.
In CICLing ’02: Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text
Processing, pages 136–145, London, UK, 2002. Springer-Verlag.

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R Hruschka Jr, and Tom M Mitchell.
Toward an Architecture for Never-Ending Language Learning.
pages 1306–1313.

Oren Etzioni, Ana-maria Popescu, Daniel S Weld, Doug Downey, and Alexander Yates.
Web-Scale Information Extraction in KnowItAll ( Preliminary Results ).
pages 100–110, 2004.

Marjorie Freedman, Lance Ramshaw, Elizabeth Boschee, Ryan Gabbard, Gary Kratkiewicz, Nicolas Ward, and Ralph
Weischedel.
Extreme Extraction – Machine Reading in a Week.
(1):1437–1446, 2011.

Heng Ji and Ralph Grishman.
Knowledge Base Population : Successful Approaches and Challenges.

N Kushmerick.
Wrapper Induction for information Extraction.
In IJCAI97, 1997.

Girija Limaye.


Background
Methodology
Evaluation
Conclusions
References

Annotating and Searching Web Tables Using Entities , Types and Relationships.

Vanessa Lopez, Miriam Fernández, Nico Stieler, Enrico Motta, Walton Hall, Milton Keynes Mkaa, and United Kingdom.
PowerAqua : supporting users in querying and exploring the Semantic Web content.

David Milne and Ian H Witten.
Learning to Link with Wikipedia.
pages 509–518, 2007.

Ndapandula Nakashole and Martin Theobald.
Scalable knowledge harvesting with high precision and high recall.
on Web search and data mining, (1955):227–236, 2011.


Lodie overview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (13)

Ähnlich wie Lodie overview

Ähnlich wie Lodie overview (20)

Lodie overview