Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
The DERI Reading Group
Ontology-based information extraction:
An Overview & Survey
(2010, Wimalasuriya and Dou)
Tobias Wun...
Definition - Motivation
a) Create content for the Semantic Web
 convert existing websites into ontologies
b) Improve qual...
Overview
 Access to information…
Overview
 Access to information…
Ontologie-based Information Extraction (OBIE):
“A system that processes unstructured or ...
Overview
 ESWC dogfood OBIE-related topics
New!
T
 1. Text only:
 Extract conceptualization and instances
County
building with café
and football table
Building
is-a
1. ...
T
County
building with café
and football table
Building
is-a
1. conceptualization
2. instances
Galway DERI building
Proble...
T
City Buildinglocated
in
Conceptualization
by domain ontology
2. instances
Galway DERI building
Problem – two scenarios
...
T
City Buildinglocated
in
Conceptualization
by domain ontology
2. instances
Galway DERI building
Problem – two scenarios
...
Definition – key characteristics
a) Process structured / unstructured text
b) “guided” by an ontology
c) Present output in...
Definition – ontology learning or population?
 Ontology population ⊂ OBIE
 “OBIE is Open information extraction” (Etzion...
Methods
 Information extractors
1. Linguistic rules
2. Gazetteer lists
3. Classification (classical / structure-aware)
4....
Linguistic Rules - Methods
 Regular expressions
 <COMPANY> .* revenue <Number> <currency>
“Tesco’s revenue in 2009 was 3...
 2. Gazetteer lists
 Phrases / words instead of patterns
 Named-Entity Recognition
 Requirements:
1) Specify what is b...
 3. Classification techniques
 Break down IE task in a set of binary tasks
Classification Methods
pos
semTag
c1
c2
..
cn...
 Classical
Classification Methods
Galway Germany DERI Siemens
GEIrelandMunich CITEC
missclassification does
not consider ...
W1,6=3
 Structure aware
Classification Methods
Galway Germany Siemens
GEIrelandMunich CITEC
Classifier should
consider ta...
 4. Partial parse trees
 TACITUS, SMES, LTAG
 5. Analyze structured data
 Wikpedia Infoboxes
 6. Web querying
 C-PAN...
Technologies used in implementation
 Shallow NLP (GATE, sProUT, StanfordNLP)
 POS, sentence splitting, regular expressio...
Data sets & evaluation
Data sets (corpora)
1) Message Understanding Conference (MUC-7)
2) Automatic Content Extraction (AC...
Recent Open IE argument
 Con: Weikum, From Information to Knowledge -
Harvest Web Resources for IE
 Disambiguation
 NL ...
Conclusion and Outlook
 No established/ agreed methods yet
 Is OBIE also ontology learning?
 Data sets
 Methods for be...
References
[1] Wimalasuriya, Dou, Ontology-based Information
Extraction: An Introduction and Survey of current
approaches,...
Nächste SlideShare
Wird geladen in …5
×

Ontology-based information extraction in the DERI Reading Group

2.087 Aufrufe

Veröffentlicht am

The DERI Reading Group (10.11.2010)

http://www.deri.ie/teaching/reading-groups/archive/

Veröffentlicht in: Technologie, Bildung
  • Gehören Sie zu den Ersten, denen das gefällt!

Ontology-based information extraction in the DERI Reading Group

  1. 1. The DERI Reading Group Ontology-based information extraction: An Overview & Survey (2010, Wimalasuriya and Dou) Tobias Wunner, UNLP Group  Copyright 2010 Digital Enterprise Research Institute. All rights reserved, Paul Buitelaar
  2. 2. Definition - Motivation a) Create content for the Semantic Web  convert existing websites into ontologies b) Improve quality of existing ontologies  Test criterion: OBIE task  OBIE good => ontology good
  3. 3. Overview  Access to information…
  4. 4. Overview  Access to information… Ontologie-based Information Extraction (OBIE): “A system that processes unstructured or semi- structured natural language text guided by an ontology and presents the output in an ontology.
  5. 5. Overview  ESWC dogfood OBIE-related topics New!
  6. 6. T  1. Text only:  Extract conceptualization and instances County building with café and football table Building is-a 1. conceptualization 2. instances Galway DERI building Problem – two scenarios
  7. 7. T County building with café and football table Building is-a 1. conceptualization 2. instances Galway DERI building Problem – two scenarios  conceptualization can be too specific / generic  wrong conceptualization  1. Text only:  Extract conceptualization and instances
  8. 8. T City Buildinglocated in Conceptualization by domain ontology 2. instances Galway DERI building Problem – two scenarios  2. Domain ontology & text:  extract instances only
  9. 9. T City Buildinglocated in Conceptualization by domain ontology 2. instances Galway DERI building Problem – two scenarios  2. Domain ontology & text:  extract instances only less generic but more semantic stable
  10. 10. Definition – key characteristics a) Process structured / unstructured text b) “guided” by an ontology c) Present output in ontology Text Source Information Extractor Ontology guided by
  11. 11. Definition – ontology learning or population?  Ontology population ⊂ OBIE  “OBIE is Open information extraction” (Etzioni)  alternative: semantics given by ontology!  extractors can be inside / outside ontology Text Source Information Extractor Ontology guided by
  12. 12. Methods  Information extractors 1. Linguistic rules 2. Gazetteer lists 3. Classification (classical / structure-aware) 4. Partial parse trees 5. Structured data analyzers 6. Web querying
  13. 13. Linguistic Rules - Methods  Regular expressions  <COMPANY> .* revenue <Number> <currency> “Tesco’s revenue in 2009 was 3.4 billion GBP.”  Extraction ontologies  combination of ontology and lexicon (Mädche, Embley, Buitelaar)  manual construction  High precision
  14. 14.  2. Gazetteer lists  Phrases / words instead of patterns  Named-Entity Recognition  Requirements: 1) Specify what is being extracted 2) Specify sources and avoid manual creation Gazetteer Methods Sematic Web Software Energy Supermarket … industry The software giant SAP… Tesco a UK supermarket … Siemens energy revenue… … wind energy company Vestas
  15. 15.  3. Classification techniques  Break down IE task in a set of binary tasks Classification Methods pos semTag c1 c2 .. cn Classifier features
  16. 16.  Classical Classification Methods Galway Germany DERI Siemens GEIrelandMunich CITEC missclassification does not consider structure! (equal cost 1/6) DERI TescoCladdagh DERI CountryCity SW Energy IndustryLocation
  17. 17. W1,6=3  Structure aware Classification Methods Galway Germany Siemens GEIrelandMunich CITEC Classifier should consider taxonomy structure! TescoCladdagh DERI
  18. 18.  4. Partial parse trees  TACITUS, SMES, LTAG  5. Analyze structured data  Wikpedia Infoboxes  6. Web querying  C-PANKOW  “Towards the self annotating web Other methods
  19. 19. Technologies used in implementation  Shallow NLP (GATE, sProUT, StanfordNLP)  POS, sentence splitting, regular expression  Semantic lexicons (WordNet, GermaNet)  synonym, meronym, hypernym  Semantic Annotation (OCAT, iDocument, PIMO) Missing  Terminological tools (UMLS, bio terminologies)  Thesauri, translation memory
  20. 20. Data sets & evaluation Data sets (corpora) 1) Message Understanding Conference (MUC-7) 2) Automatic Content Extraction (ACE)  => more on classical IR, IE, NLP tracks  => no data set with given semantics (ontology) Evaluation  Precision & recall  Only used for population task
  21. 21. Recent Open IE argument  Con: Weikum, From Information to Knowledge - Harvest Web Resources for IE  Disambiguation  NL relations are not well defined (well defined arguments)  Pro: Weld, Using Wiki to Bootrap Open IE  Relation targeted:  learn extractor per relation -> lower recall  Structural targeted:  general extraction engine -> lower precision
  22. 22. Conclusion and Outlook  No established/ agreed methods yet  Is OBIE also ontology learning?  Data sets  Methods for best extractors  Semantic Web contribution?  eg. Gazetteers from DBPedia  Cross-lingual OBIE -> CLOBIE
  23. 23. References [1] Wimalasuriya, Dou, Ontology-based Information Extraction: An Introduction and Survey of current approaches, in Journal of Computer Science, June 2010 [2] Buitelaar et Al., Towards linguistically grounded ontologies., ESWC, Springer, 200 [3] Weikum et Al, From Information to Knowledge – Harvesting Entities and Relationships from Web Sources, Principle Database Systems, 2010 [4] Weld et al., Using Wikipedia to bootstrap open information extraction, Sigmod Record, 2008

×