SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Reading Course
                Omer Gunes, Somerville College
               Supervisor : Professor Georg Gottlob




Article 1: “Ontology Based Information Extraction”
Article 2: “YAGO : Yet Another Great Ontology”
Article 3: “Sophie : A Self-Organized Framework for Information
Extraction”
                     D. Phil. In Computer Science
             Computing Laboratory, University of Oxford
Definitions
Information Extraction
    Russell and Norvig;
●   Information Extraction
     ●   is automatically retrieving certain types of information from NL texts
     ●   aims
          –   processing natural language texts
          –   retrieving occurrences of
                ● a particular class of objects
                ● relationships between these objects


     ●   lies between
          –   Information retrieval systems
          –   Text understanding systems
    Riloff;
●   Information Extraction is a form of NLP in which certain types of information must be
     ●   recognized
     ●   extracted from text
Definitions
OBIE : Ontology Based Information Extraction
●   OBIE has recently emerged as a subfield of IE.
●   An OBIE system
    ●   Processes
         – Unstructured
         – Structured

           NL text through a mechanism guided by ontologies
    ●   Extracts certain types of information
    ●   Presents the output using ontologies
●    An OBIE system is a set of information extractors each
    extracting
    ●   Individuals for a class
    ●   Property values for a property
OBIE
Key Characteristics of OBIE Systems
●   Process unstructured natural language text
    e.g. text files
    Process semi-structured natural language text
    e.g. web pages using a particular template such as Wikipedia
    pages
●   Present the output using ontologies
    ●   The use of a formal ontology as one of the system inputs
        and the target output
    ●   Constructing the ontology to be used through the IE process
●   Use an IE process guided by an ontology
    ●   The IE process guided by the ontology to extract things such
        as classes, properties and instances
OBIE
Potential of OBIE Systems
●   Automatically processing the information contained in natural
    language text
    ●   A large fraction of the information contained in the WWW
        takes the form of natural language text
    ●   Around %80 of the data of a typical corporation are in
        natural language
●   Creating semantic contexts for the Semantic Web
    ●   The success of semantic web relies heavily on the existence
        of semantic contents that can be processed by software
        agents
        the creation of such contents are quite slow
    ●   Automatic meta-data creation would be the snowball to
        unleash an avalanche of metadata through the web.
OBIE
Common Architecture of an OBIE System
         User                                            OBIE
                                                         System

                                            Ontology
         Query                               Editor
        Answer
        System

                       Information                          Ontology
                                            Ontology        Generator
                        Extraction
       Knowledge          Module
         Base    Extracted
            /    Information                   Semantic
        Database             Preprocessor       Lexicon




                                            Text Input
Classification of OBIE Systems
Information Extraction Methods
●   Linguistic Rules
    ●   They are represented by regular expressions.
        e.g. (watched|seen) <NP>
    ●   Regular expressions are combined with the elements of ontologies such
        as classes and properties.
●   Gazetteer Lists
    ●   Relies on finite-state automata
    ●   Recognizes individual words or phrases instead of patterns
    ●   Consists of a list of a set of words which identifies individual entities of a
        particular category
        e.g. the states of US, the countries of the world
        2 conditions to construct a gazetteer list
    ●   Specify exactly what is being identified by the gazetteer
    ●   Specify where the information for the gazetteer list was obtained from
Classification of OBIE Systems
Information Extraction Methods
●   Supervised classification techniques for Information Extraction
    ●   Support Vector Machines
    ●   Maximum Entropy Models
    ●   Decision Trees
    ●   Hidden Markov Models
    ●   Conditional Random Fields
●   Classifiers are trained to identify different components of an ontology such
    as instances and property values
    e.g. In Kylin, which is a tool developed by University of Washington, they use
    ●   CRF model to identify attributes within a sentence
    ●   ME model to predict which attribute values are present in a sentence.
Classification of OBIE Systems
Analyzing HTML/XML tags
●   Some OBIE systems use html or xml pages as input
    They can extract certain types of information from tables present in html
    pages.
    e.g. The SOBA system which is developed at Karlsruhen Institut for Tech.
       Extracts information from HTML tables into a knowledgebase that uses
       F-Logic.
Classification of OBIE Systems
Ontology Construction and Update
    2 approaches
●   Considering the ontology as an input to the system.
    Ontology can be constructed manually.
    An “off-the-shelf” ontology can also be used.
    Systems adopting this approach :
    ●   SOBA
    ●   KIM
    ●   PANKOW
●   The paradigm of open information extraction which tells to construct an
    ontology as a part of the OBIE process.
    Systems adopting the approaches :
    ●   Text-to-Onto
    ●   Kylin
Classification of OBIE Systems
Summary of the classification of OBIE Systems
Classification of OBIE Systems
Tools used
  Shallow NLP tools
   ●   GATE
   ●   SproUT
   ●   Stanford NLP Group
   ●   Center For Intelligent Information Retrieval (CIIR) at the Univ. of Massachusetts
   ●   Saarbrücken Message Extracting System
  Semantic lexicons
   ●   WordNet
   ●   GermaNet for German
   ●   Hindi WordNet for Hindi
  Ontology Editors
   ●   Protege
   ●   OntoEdit
YAGO
Introduction
●   A light weight & extensible ontology with high coverage & quality.
●   1 million entities
    5 million facts automatically extracted from Wikipedia unified with WordNet
●   A carefully designed combination of heuristic & rule-based methods
●   Contributions:
    A major step beyond WordNet both in quality and in quantity
    Empirical evaluation of fact correctness is %95.
●   It is based on a clean model which is decidable, extensible and compatible
    with RDFS.
YAGO
Background & Motivation
  The need for
  a huge ontology with knowledge from several sources
  an ontology of high quality with accuracy close to 100 percent
  an ontology have to comprise concepts not only in the style of WordNet but also
  named entities like
   ●   People
   ●   Organizations
   ●   Books
   ●   Songs
   ●   Products
  An ontology have to be
   ●   Extensible
   ●   Easily re-usable
   ●   Application independent
YAGO
Yet Another Great Ontology
●   YAGO is based on Wikipedia's category pages.
    Drawback : Wikipedia's bareness of hierarchies
    Counterwise : WordNet provides a clean & carefully assembled hierarchy of
    thousands of concepts
    Problem : There is no mapping between Wikipedia and WordNet concepts
    Proposal : New techniques to link them with perfect accuracy
●   Contribution : Accomplishing the unification between WordNet and facts from
    Wikipedia with an accuracy of 97%.
    A data model which is based on entities & binary relations.
●   General properties of relations and relations between relations can also be
    expressed.
●   It is designed to be extendable by other resources
     ●   Other high quality resources
     ●   Domain specific extensions
     ●   Data gathered through Information Extraction
YAGO
Model
    The state-of-the art formalism in knowledge representation
●   OWL-Full : It can express properties of relations. It is decidable.
●   OWL-Lite & OWL-DL : It can not express relations between facts
●   RDFS : It can express relations between facts. It provides only very primitive
    semantics (e.g. does not know transitivity)
●   YAGO Model : It is an extension of RDFS
    All objects are represented as entities.
    e.g. AlbertEinstein
    Two entities can stand in a relation.
    e.g. AlbertEinstein HASWONPRIZE NobelPrize
    number, dates, string and other literals are represented as entities as well.
    e.g. AlbertEinstein BORNINYEAR 1879
YAGO
Model
●   Words are entities as well.
    e.g. “Einstein” MEANS AlbertEinstein
        “Einstein” MEANS AlfredEinstein
●   Similar entities are grouped into classes.
    Each entity is an instance of at least one class. This is expressed by the TYPE
    relation.
    e.g. AlbertEinstein TYPE physicist
●   Classes are also entities
    Each class is an instance of a class called class
    Classes are expressed in a taxonomic hiearchy, expressed by the subClassOf
    relation.
    e.g. physicist subClassOf scientist
YAGO
Model
●   Relations are entities as well.
    The properties of relations can be represented within the model.
    e.g. subClassOf TYPE transitiveRelation
●   A fact is a triple of
     ●   an entity
     ●   a relation
     ●   an entity
●   Two entities are called the arguments of the fact.
    Each fact is given a fact identifier which is also an entity.
    Assumption : #1 is the identifier of the fact (AlbertEinstein, BORNINYEAR, 1879)
    then #1 FOUNDIN http://wikipedia.org/Einstein
●   Common entities are entities which are neither facts nor relations.
    Common entities that are not classes are called individuals.
YAGO
Ontology
●   C : a finite set of common entities
    R : a finite set of relation names
    I : a finite set of fact identifiers
    A YAGO ontology y is an injective & total function which is

         y : I  I ∪C∪R∗R∗ I ∪C∪ R


●   The set of relation names R for any YAGO ontology must contain at least :
     ●    Type
     ●    subClassOf
     ●    Domain
     ●    Range
     ●    subRelationOf
YAGO
Ontology
 The set of class names C for any YAGO ontology must contain at least :
  ●   Entity
  ●   Class
  ●   Relation
  ●   aCyclicTransitiveRelation
  ●   Classes for all literals
YAGO
Knowledge Extraction – The TYPE Relation
●   Each Wiki page title is a candidate to become an individual in YAGO
●   The Wikipedia Category System :
    Different types of categories
     ●   Conceptual categories
         e.g. Naturalized citizens of United States
     ●   Administrative categories
         e.g. Articles with unsourced statements
     ●   Relational categories
         e.g. 1879 births
     ●   Thematic vicinity categories
         e.g. Physics
YAGO
Knowledge Extraction – The MEANS Relation
●   WordNet reveals the meaning of words by its synsets.
    “urban center” & “metropolis” belong to the “city” synset
    (“metropolis”, MEANS, city)
●   Wikipedia redirect system gives alternative names for the entities.
    (“Einstein, Albert”, MEANS, Albert Einstein)
●   The relations GivenNameOf & FamilyNameOf are sub-relations of MEANS
    A name parser is used to identify and to decompose person names.
    (“Einstein”, FamilyNameOf, AlbertEinstein)
YAGO
Evaluation – Accuracy of YAGO
YAGO
Evaluation – Size of YAGO (facts)
YAGO
Evaluation – Size of YAGO (entities) and Size of
Other Ontologies
YAGO
Evaluation – Sample facts on YAGO
YAGO
Evaluation – Sample queries on YAGO
SOFIE
Background & Motivation
●   3 systems; YAGO, Kylin/KOG, Dbpedia;
    ●   used IE methods for constructing large ontologies.
    ●   leveraged high-quality hand crafted sources with latent knowledge for
        collecting individual entities & facts.
    ●   combined there results with a taxonomical hierarchy like, WordNet &
        SUMO.
●   Empirical assessment has shown that these approaches have an acc. > 95.
●   They are close to the best hand-crafted rules.
●   The resulting ontologies
    ●   contain
         –   millions of entities
         –   10s of millions of facts
    ●   organized in a consistent manner by a transitive & acyclic subclass
        relation.
SOFIE
Background & Motivation
●   Next Stage : Expanding&Maintaining automatically compiled
    ontologies as knowledge keeps evolving.
●   Wikipedia's semi-structured knowledge is huge but limited.
●   NL text sources such as news articles, biographies, scientific
    pubs., full text Wiki articles must be brought into scope.
●   Existing %80 accuracy is unacceptable for an ontological
    knowledge base.
●   Key Idea : To leverage existing ontology for its own growth :
    ●   use trusted facts as a basis for generating good patterns.
    ●   Scrutinize the resulting hypotheses with regard to their
        consistency with the already knows facts.
SOFIE
Background & Motivation
●   3 systems; YAGO, Kylin/KOG, Dbpedia;
    ●   used IE methods for constructing large ontologies.
    ●   leveraged high-quality hand crafted sources with latent knowledge for
        collecting individual entities & facts.
    ●   combined there results with a taxonomical hierarchy like, WordNet &
        SUMO.
●   Empirical assessment has shown that these approaches have an acc. > 95.
●   They are close to the best hand-crafted rules.
●   The resulting ontologies
    ●   contain
         –   millions of entities
         –   10s of millions of facts
    ●   organized in a consistent manner by a transitive & acyclic subclass
        relation.
SOFIE
Contribution
●   3 problems;
    ●   Pattern selection
    ●   Entity disambiguation
    ●   Consistency checking
●   Proposed Approach :
    ●   A unified model for ontology oriented IE solving all 3 issues
        simultaneously.
    ●   They cast known facts, hypotheses for new facts, word-to-
        entity mappings, gathered sets of patterns, a configurable
        set of semantic constraints into a unified framework of
        logical clauses.
●   All 3 problems can be seen as a Weighted MAX-SAT problem,
    which is task of identfying a maximal set of consisten clauses.
SOFIE
Contribution
    Implementation :
●   A new model for consistent growth of a large ontology.
●   A unified model for
    ●   Pattern selection
    ●   Entity disambiguation
    ●   Consistency checking
    ●   Identification of the best hypotheses for new facts
●   An efficient algorithm for the resulting Weighted MAX-SAT
    algorithm.
●   Experiments with a variety of real-life textual & semi-structured
    sources to demonstrate the scalability & high-accuracy of the
    approach.
SOFIE
Contribution
●   3 problems;
    ●   Pattern selection
    ●   Entity disambiguation
    ●   Consistency checking
●   Proposed Approach :
    ●   A unified model for ontology oriented IE solving all 3 issues
        simultaneously.
    ●   They cast known facts, hypotheses for new facts, word-to-
        entity mappings, gathered sets of patterns, a configurable
        set of semantic constraints into a unified framework of
        logical clauses.
●   All 3 problems can be seen as a Weighted MAX-SAT problem,
    which is task of identfying a maximal set of consisten clauses.
SOFIE
Model
    Assumption :
●   An existing ontology
●   A given corpus of documents
    Statements :
●   A word in context (wic) is a pair of word & context.
●   The context of a word is simply the document in which the word
    appears. Thus, a wic is a pair of word & document. The notation
    being used is word@document.
●   Wics are entities as well.
●   Each statement can have an associated truth value of 1|0.
SOFIE
    Model
     Statements :
●    A prefix notation is used for statements as SOPHIE deals with relations of
     arbitrary arity.
         e.g. bornIn(Albert Einstein, Ulm)[1]
     A fact is a statement with truth value 1.
     A statement with an unknown truth value is called a hypothesis.
●    The ontology is considered as a set of facts.
         e.g. bornOnDate(AlbertEinstein, 1879-03-14)[1]
●    SOFIE will extract textual information from the corpus.
     Textual information takes the forms of facts.
     One type of facts makes assertions about the number of times that a pattern
     occurred with two wics.

         e.g. patternOcc(“X went to school in Y”, Einstein@29, Germany@29)[1]
SOFIE
    Model
     Statements :
●    Another type of facts. How likely it is from a linguistic point of view that a wic refers to a certain entity. It
     is called a “disambiguation prior”.

          e.g. disambiguation prior of the wic Elvis@29
               disambPrior(Elvis@29, Elvis Presley, 0.8)[1]
               disambPrior(Elvis@29, Elvis Costello, 0.2)[1]
●    Hypotheses that are formed by SOFIE
      ●   Can concern the disambiguation of wics,
          e.g. disambigauteAs(Java@DS, JavaProgrammingLanguage)[?]
      ●   Can hypothesize about whether a certain pattern expresses a certain relation expresses (“X was
          born in Y”, bornInLocation)
      ●   Can hypothesize about potential new facts
          e.g. developed(Microsoft, JavaProgrammingLanguage)[?]
●    Unifying Framework :
      ●   SOFIE unifies the domains of ontology & information extraction.
      ●   For SOFIE, there exist only statements.
      ●   SOFIE will try to figure out which hypotheses are likely to be true.

Weitere ähnliche Inhalte

Was ist angesagt?

Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic webWorawith Sangkatip
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyDebashisnaskar
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSKishan Patel
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreAdriel Café
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseAndrea Nuzzolese
 
The Standardization of Semantic Web Ontology
The Standardization of Semantic Web OntologyThe Standardization of Semantic Web Ontology
The Standardization of Semantic Web OntologyMyungjin Lee
 
NCBO SPARQL Endpoint
NCBO SPARQL EndpointNCBO SPARQL Endpoint
NCBO SPARQL EndpointTrish Whetzel
 
A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications IJwest
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DLAndrea Nuzzolese
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionKent State University
 
Ontology Engineering for Big Data
Ontology Engineering for Big DataOntology Engineering for Big Data
Ontology Engineering for Big DataKouji Kozaki
 
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...María Poveda Villalón
 
Ontology and its various aspects
Ontology and its various aspectsOntology and its various aspects
Ontology and its various aspectssamhati27
 

Was ist angesagt? (19)

Ontology mapping for the semantic web
Ontology mapping for the semantic webOntology mapping for the semantic web
Ontology mapping for the semantic web
 
Ontologies
OntologiesOntologies
Ontologies
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical Study
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
 
Ontology
OntologyOntology
Ontology
 
ONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESSONTOLOGY BASED DATA ACCESS
ONTOLOGY BASED DATA ACCESS
 
Ontology
Ontology Ontology
Ontology
 
Ontology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and moreOntology integration - Heterogeneity, Techniques and more
Ontology integration - Heterogeneity, Techniques and more
 
Knowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuseKnowledge Patterns for the Web: extraction, transformation, and reuse
Knowledge Patterns for the Web: extraction, transformation, and reuse
 
The Standardization of Semantic Web Ontology
The Standardization of Semantic Web OntologyThe Standardization of Semantic Web Ontology
The Standardization of Semantic Web Ontology
 
Oke
OkeOke
Oke
 
NCBO SPARQL Endpoint
NCBO SPARQL EndpointNCBO SPARQL Endpoint
NCBO SPARQL Endpoint
 
A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications
 
Semantic Technologies in ST&DL
Semantic Technologies in ST&DLSemantic Technologies in ST&DL
Semantic Technologies in ST&DL
 
Prolog final
Prolog finalProlog final
Prolog final
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
Ontology Engineering for Big Data
Ontology Engineering for Big DataOntology Engineering for Big Data
Ontology Engineering for Big Data
 
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
 
Ontology and its various aspects
Ontology and its various aspectsOntology and its various aspects
Ontology and its various aspects
 

Andere mochten auch

Diadem 0.1
Diadem 0.1Diadem 0.1
Diadem 0.1timfu
 
Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010Giorgio Orsi
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)Giorgio Orsi
 
Joint Repairs for Web Wrappers
Joint Repairs for Web WrappersJoint Repairs for Web Wrappers
Joint Repairs for Web WrappersGiorgio Orsi
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 

Andere mochten auch (8)

Diadem 0.1
Diadem 0.1Diadem 0.1
Diadem 0.1
 
Table Recognition
Table RecognitionTable Recognition
Table Recognition
 
Web Data Extraction Como2010
Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010
 
AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)AMBER WWW 2012 (Demonstration)
AMBER WWW 2012 (Demonstration)
 
diadem-vldb-2015
diadem-vldb-2015diadem-vldb-2015
diadem-vldb-2015
 
Joint Repairs for Web Wrappers
Joint Repairs for Web WrappersJoint Repairs for Web Wrappers
Joint Repairs for Web Wrappers
 
DIADEM WWW 2012
DIADEM WWW 2012DIADEM WWW 2012
DIADEM WWW 2012
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 

Ähnlich wie NLP in Web Data Extraction (Omer Gunes)

Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsMelanie Courtot
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
How to model digital objects within the semantic web
How to model digital objects within the semantic webHow to model digital objects within the semantic web
How to model digital objects within the semantic webAngelica Lo Duca
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Databaseijbuiiir1
 
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...Neuroscience Information Framework
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelMihika Shah
 
A Framework for Ontology Usage Analysis
A Framework for Ontology Usage AnalysisA Framework for Ontology Usage Analysis
A Framework for Ontology Usage AnalysisJamshaid Ashraf
 
SWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalSWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalgowthamnaidu0986
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1iotest
 
A Comparative Study of Ontology building Tools in Semantic Web Applications
A Comparative Study of Ontology building Tools in Semantic Web Applications A Comparative Study of Ontology building Tools in Semantic Web Applications
A Comparative Study of Ontology building Tools in Semantic Web Applications dannyijwest
 
A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications dannyijwest
 
OODBMS Concepts - National University of Singapore.pdf
OODBMS Concepts - National University of Singapore.pdfOODBMS Concepts - National University of Singapore.pdf
OODBMS Concepts - National University of Singapore.pdfssuserd5e338
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware ApproachGeorge Ang
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Takeshi Morita
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a librarySEECS NUST
 
Generating Lexical Information for Terminology in a Bioinformatics Ontology
Generating Lexical Information for Terminologyin a Bioinformatics OntologyGenerating Lexical Information for Terminologyin a Bioinformatics Ontology
Generating Lexical Information for Terminology in a Bioinformatics OntologyHammad Afzal
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word CloudsMarina Santini
 

Ähnlich wie NLP in Web Data Extraction (Omer Gunes) (20)

Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
How to model digital objects within the semantic web
How to model digital objects within the semantic webHow to model digital objects within the semantic web
How to model digital objects within the semantic web
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
NIFSTD and NeuroLex: A Comprehensive Ontology Development Based on Multiple B...
 
Ontologies Fmi 042010
Ontologies Fmi 042010Ontologies Fmi 042010
Ontologies Fmi 042010
 
Representation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object modelRepresentation of ontology by Classified Interrelated object model
Representation of ontology by Classified Interrelated object model
 
A Framework for Ontology Usage Analysis
A Framework for Ontology Usage AnalysisA Framework for Ontology Usage Analysis
A Framework for Ontology Usage Analysis
 
SWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professionalSWSN UNIT-3.pptx we can information about swsn professional
SWSN UNIT-3.pptx we can information about swsn professional
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 
A Comparative Study of Ontology building Tools in Semantic Web Applications
A Comparative Study of Ontology building Tools in Semantic Web Applications A Comparative Study of Ontology building Tools in Semantic Web Applications
A Comparative Study of Ontology building Tools in Semantic Web Applications
 
A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications A Comparative Study Ontology Building Tools for Semantic Web Applications
A Comparative Study Ontology Building Tools for Semantic Web Applications
 
OODBMS Concepts - National University of Singapore.pdf
OODBMS Concepts - National University of Singapore.pdfOODBMS Concepts - National University of Singapore.pdf
OODBMS Concepts - National University of Singapore.pdf
 
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
Automatically Generating Wikipedia Articles:  A Structure-Aware ApproachAutomatically Generating Wikipedia Articles:  A Structure-Aware Approach
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
 
Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...Integrating a Domain Ontology Development Environment and an Ontology Search ...
Integrating a Domain Ontology Development Environment and an Ontology Search ...
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Building a Semantic search Engine in a library
Building a Semantic search Engine in a libraryBuilding a Semantic search Engine in a library
Building a Semantic search Engine in a library
 
Generating Lexical Information for Terminology in a Bioinformatics Ontology
Generating Lexical Information for Terminologyin a Bioinformatics OntologyGenerating Lexical Information for Terminologyin a Bioinformatics Ontology
Generating Lexical Information for Terminology in a Bioinformatics Ontology
 
C04 07 1519
C04 07 1519C04 07 1519
C04 07 1519
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 

NLP in Web Data Extraction (Omer Gunes)

  • 1. Reading Course Omer Gunes, Somerville College Supervisor : Professor Georg Gottlob Article 1: “Ontology Based Information Extraction” Article 2: “YAGO : Yet Another Great Ontology” Article 3: “Sophie : A Self-Organized Framework for Information Extraction” D. Phil. In Computer Science Computing Laboratory, University of Oxford
  • 2. Definitions Information Extraction Russell and Norvig; ● Information Extraction ● is automatically retrieving certain types of information from NL texts ● aims – processing natural language texts – retrieving occurrences of ● a particular class of objects ● relationships between these objects ● lies between – Information retrieval systems – Text understanding systems Riloff; ● Information Extraction is a form of NLP in which certain types of information must be ● recognized ● extracted from text
  • 3. Definitions OBIE : Ontology Based Information Extraction ● OBIE has recently emerged as a subfield of IE. ● An OBIE system ● Processes – Unstructured – Structured NL text through a mechanism guided by ontologies ● Extracts certain types of information ● Presents the output using ontologies ● An OBIE system is a set of information extractors each extracting ● Individuals for a class ● Property values for a property
  • 4. OBIE Key Characteristics of OBIE Systems ● Process unstructured natural language text e.g. text files Process semi-structured natural language text e.g. web pages using a particular template such as Wikipedia pages ● Present the output using ontologies ● The use of a formal ontology as one of the system inputs and the target output ● Constructing the ontology to be used through the IE process ● Use an IE process guided by an ontology ● The IE process guided by the ontology to extract things such as classes, properties and instances
  • 5. OBIE Potential of OBIE Systems ● Automatically processing the information contained in natural language text ● A large fraction of the information contained in the WWW takes the form of natural language text ● Around %80 of the data of a typical corporation are in natural language ● Creating semantic contexts for the Semantic Web ● The success of semantic web relies heavily on the existence of semantic contents that can be processed by software agents the creation of such contents are quite slow ● Automatic meta-data creation would be the snowball to unleash an avalanche of metadata through the web.
  • 6. OBIE Common Architecture of an OBIE System User OBIE System Ontology Query Editor Answer System Information Ontology Ontology Generator Extraction Knowledge Module Base Extracted / Information Semantic Database Preprocessor Lexicon Text Input
  • 7. Classification of OBIE Systems Information Extraction Methods ● Linguistic Rules ● They are represented by regular expressions. e.g. (watched|seen) <NP> ● Regular expressions are combined with the elements of ontologies such as classes and properties. ● Gazetteer Lists ● Relies on finite-state automata ● Recognizes individual words or phrases instead of patterns ● Consists of a list of a set of words which identifies individual entities of a particular category e.g. the states of US, the countries of the world 2 conditions to construct a gazetteer list ● Specify exactly what is being identified by the gazetteer ● Specify where the information for the gazetteer list was obtained from
  • 8. Classification of OBIE Systems Information Extraction Methods ● Supervised classification techniques for Information Extraction ● Support Vector Machines ● Maximum Entropy Models ● Decision Trees ● Hidden Markov Models ● Conditional Random Fields ● Classifiers are trained to identify different components of an ontology such as instances and property values e.g. In Kylin, which is a tool developed by University of Washington, they use ● CRF model to identify attributes within a sentence ● ME model to predict which attribute values are present in a sentence.
  • 9. Classification of OBIE Systems Analyzing HTML/XML tags ● Some OBIE systems use html or xml pages as input They can extract certain types of information from tables present in html pages. e.g. The SOBA system which is developed at Karlsruhen Institut for Tech. Extracts information from HTML tables into a knowledgebase that uses F-Logic.
  • 10. Classification of OBIE Systems Ontology Construction and Update 2 approaches ● Considering the ontology as an input to the system. Ontology can be constructed manually. An “off-the-shelf” ontology can also be used. Systems adopting this approach : ● SOBA ● KIM ● PANKOW ● The paradigm of open information extraction which tells to construct an ontology as a part of the OBIE process. Systems adopting the approaches : ● Text-to-Onto ● Kylin
  • 11. Classification of OBIE Systems Summary of the classification of OBIE Systems
  • 12. Classification of OBIE Systems Tools used Shallow NLP tools ● GATE ● SproUT ● Stanford NLP Group ● Center For Intelligent Information Retrieval (CIIR) at the Univ. of Massachusetts ● Saarbrücken Message Extracting System Semantic lexicons ● WordNet ● GermaNet for German ● Hindi WordNet for Hindi Ontology Editors ● Protege ● OntoEdit
  • 13. YAGO Introduction ● A light weight & extensible ontology with high coverage & quality. ● 1 million entities 5 million facts automatically extracted from Wikipedia unified with WordNet ● A carefully designed combination of heuristic & rule-based methods ● Contributions: A major step beyond WordNet both in quality and in quantity Empirical evaluation of fact correctness is %95. ● It is based on a clean model which is decidable, extensible and compatible with RDFS.
  • 14. YAGO Background & Motivation The need for a huge ontology with knowledge from several sources an ontology of high quality with accuracy close to 100 percent an ontology have to comprise concepts not only in the style of WordNet but also named entities like ● People ● Organizations ● Books ● Songs ● Products An ontology have to be ● Extensible ● Easily re-usable ● Application independent
  • 15. YAGO Yet Another Great Ontology ● YAGO is based on Wikipedia's category pages. Drawback : Wikipedia's bareness of hierarchies Counterwise : WordNet provides a clean & carefully assembled hierarchy of thousands of concepts Problem : There is no mapping between Wikipedia and WordNet concepts Proposal : New techniques to link them with perfect accuracy ● Contribution : Accomplishing the unification between WordNet and facts from Wikipedia with an accuracy of 97%. A data model which is based on entities & binary relations. ● General properties of relations and relations between relations can also be expressed. ● It is designed to be extendable by other resources ● Other high quality resources ● Domain specific extensions ● Data gathered through Information Extraction
  • 16. YAGO Model The state-of-the art formalism in knowledge representation ● OWL-Full : It can express properties of relations. It is decidable. ● OWL-Lite & OWL-DL : It can not express relations between facts ● RDFS : It can express relations between facts. It provides only very primitive semantics (e.g. does not know transitivity) ● YAGO Model : It is an extension of RDFS All objects are represented as entities. e.g. AlbertEinstein Two entities can stand in a relation. e.g. AlbertEinstein HASWONPRIZE NobelPrize number, dates, string and other literals are represented as entities as well. e.g. AlbertEinstein BORNINYEAR 1879
  • 17. YAGO Model ● Words are entities as well. e.g. “Einstein” MEANS AlbertEinstein “Einstein” MEANS AlfredEinstein ● Similar entities are grouped into classes. Each entity is an instance of at least one class. This is expressed by the TYPE relation. e.g. AlbertEinstein TYPE physicist ● Classes are also entities Each class is an instance of a class called class Classes are expressed in a taxonomic hiearchy, expressed by the subClassOf relation. e.g. physicist subClassOf scientist
  • 18. YAGO Model ● Relations are entities as well. The properties of relations can be represented within the model. e.g. subClassOf TYPE transitiveRelation ● A fact is a triple of ● an entity ● a relation ● an entity ● Two entities are called the arguments of the fact. Each fact is given a fact identifier which is also an entity. Assumption : #1 is the identifier of the fact (AlbertEinstein, BORNINYEAR, 1879) then #1 FOUNDIN http://wikipedia.org/Einstein ● Common entities are entities which are neither facts nor relations. Common entities that are not classes are called individuals.
  • 19. YAGO Ontology ● C : a finite set of common entities R : a finite set of relation names I : a finite set of fact identifiers A YAGO ontology y is an injective & total function which is y : I  I ∪C∪R∗R∗ I ∪C∪ R ● The set of relation names R for any YAGO ontology must contain at least : ● Type ● subClassOf ● Domain ● Range ● subRelationOf
  • 20. YAGO Ontology The set of class names C for any YAGO ontology must contain at least : ● Entity ● Class ● Relation ● aCyclicTransitiveRelation ● Classes for all literals
  • 21. YAGO Knowledge Extraction – The TYPE Relation ● Each Wiki page title is a candidate to become an individual in YAGO ● The Wikipedia Category System : Different types of categories ● Conceptual categories e.g. Naturalized citizens of United States ● Administrative categories e.g. Articles with unsourced statements ● Relational categories e.g. 1879 births ● Thematic vicinity categories e.g. Physics
  • 22. YAGO Knowledge Extraction – The MEANS Relation ● WordNet reveals the meaning of words by its synsets. “urban center” & “metropolis” belong to the “city” synset (“metropolis”, MEANS, city) ● Wikipedia redirect system gives alternative names for the entities. (“Einstein, Albert”, MEANS, Albert Einstein) ● The relations GivenNameOf & FamilyNameOf are sub-relations of MEANS A name parser is used to identify and to decompose person names. (“Einstein”, FamilyNameOf, AlbertEinstein)
  • 24. YAGO Evaluation – Size of YAGO (facts)
  • 25. YAGO Evaluation – Size of YAGO (entities) and Size of Other Ontologies
  • 26. YAGO Evaluation – Sample facts on YAGO
  • 27. YAGO Evaluation – Sample queries on YAGO
  • 28. SOFIE Background & Motivation ● 3 systems; YAGO, Kylin/KOG, Dbpedia; ● used IE methods for constructing large ontologies. ● leveraged high-quality hand crafted sources with latent knowledge for collecting individual entities & facts. ● combined there results with a taxonomical hierarchy like, WordNet & SUMO. ● Empirical assessment has shown that these approaches have an acc. > 95. ● They are close to the best hand-crafted rules. ● The resulting ontologies ● contain – millions of entities – 10s of millions of facts ● organized in a consistent manner by a transitive & acyclic subclass relation.
  • 29. SOFIE Background & Motivation ● Next Stage : Expanding&Maintaining automatically compiled ontologies as knowledge keeps evolving. ● Wikipedia's semi-structured knowledge is huge but limited. ● NL text sources such as news articles, biographies, scientific pubs., full text Wiki articles must be brought into scope. ● Existing %80 accuracy is unacceptable for an ontological knowledge base. ● Key Idea : To leverage existing ontology for its own growth : ● use trusted facts as a basis for generating good patterns. ● Scrutinize the resulting hypotheses with regard to their consistency with the already knows facts.
  • 30. SOFIE Background & Motivation ● 3 systems; YAGO, Kylin/KOG, Dbpedia; ● used IE methods for constructing large ontologies. ● leveraged high-quality hand crafted sources with latent knowledge for collecting individual entities & facts. ● combined there results with a taxonomical hierarchy like, WordNet & SUMO. ● Empirical assessment has shown that these approaches have an acc. > 95. ● They are close to the best hand-crafted rules. ● The resulting ontologies ● contain – millions of entities – 10s of millions of facts ● organized in a consistent manner by a transitive & acyclic subclass relation.
  • 31. SOFIE Contribution ● 3 problems; ● Pattern selection ● Entity disambiguation ● Consistency checking ● Proposed Approach : ● A unified model for ontology oriented IE solving all 3 issues simultaneously. ● They cast known facts, hypotheses for new facts, word-to- entity mappings, gathered sets of patterns, a configurable set of semantic constraints into a unified framework of logical clauses. ● All 3 problems can be seen as a Weighted MAX-SAT problem, which is task of identfying a maximal set of consisten clauses.
  • 32. SOFIE Contribution Implementation : ● A new model for consistent growth of a large ontology. ● A unified model for ● Pattern selection ● Entity disambiguation ● Consistency checking ● Identification of the best hypotheses for new facts ● An efficient algorithm for the resulting Weighted MAX-SAT algorithm. ● Experiments with a variety of real-life textual & semi-structured sources to demonstrate the scalability & high-accuracy of the approach.
  • 33. SOFIE Contribution ● 3 problems; ● Pattern selection ● Entity disambiguation ● Consistency checking ● Proposed Approach : ● A unified model for ontology oriented IE solving all 3 issues simultaneously. ● They cast known facts, hypotheses for new facts, word-to- entity mappings, gathered sets of patterns, a configurable set of semantic constraints into a unified framework of logical clauses. ● All 3 problems can be seen as a Weighted MAX-SAT problem, which is task of identfying a maximal set of consisten clauses.
  • 34. SOFIE Model Assumption : ● An existing ontology ● A given corpus of documents Statements : ● A word in context (wic) is a pair of word & context. ● The context of a word is simply the document in which the word appears. Thus, a wic is a pair of word & document. The notation being used is word@document. ● Wics are entities as well. ● Each statement can have an associated truth value of 1|0.
  • 35. SOFIE Model Statements : ● A prefix notation is used for statements as SOPHIE deals with relations of arbitrary arity. e.g. bornIn(Albert Einstein, Ulm)[1] A fact is a statement with truth value 1. A statement with an unknown truth value is called a hypothesis. ● The ontology is considered as a set of facts. e.g. bornOnDate(AlbertEinstein, 1879-03-14)[1] ● SOFIE will extract textual information from the corpus. Textual information takes the forms of facts. One type of facts makes assertions about the number of times that a pattern occurred with two wics. e.g. patternOcc(“X went to school in Y”, Einstein@29, Germany@29)[1]
  • 36. SOFIE Model Statements : ● Another type of facts. How likely it is from a linguistic point of view that a wic refers to a certain entity. It is called a “disambiguation prior”. e.g. disambiguation prior of the wic Elvis@29 disambPrior(Elvis@29, Elvis Presley, 0.8)[1] disambPrior(Elvis@29, Elvis Costello, 0.2)[1] ● Hypotheses that are formed by SOFIE ● Can concern the disambiguation of wics, e.g. disambigauteAs(Java@DS, JavaProgrammingLanguage)[?] ● Can hypothesize about whether a certain pattern expresses a certain relation expresses (“X was born in Y”, bornInLocation) ● Can hypothesize about potential new facts e.g. developed(Microsoft, JavaProgrammingLanguage)[?] ● Unifying Framework : ● SOFIE unifies the domains of ontology & information extraction. ● For SOFIE, there exist only statements. ● SOFIE will try to figure out which hypotheses are likely to be true.