NLP in Web Data Extraction (Omer Gunes)

Reading Course
Omer Gunes, Somerville College
Supervisor : Professor Georg Gottlob

Article 1: “Ontology Based Information Extraction”
Article 2: “YAGO : Yet Another Great Ontology”
Article 3: “Sophie : A Self-Organized Framework for Information
Extraction”
D. Phil. In Computer Science
Computing Laboratory, University of Oxford

Definitions
Information Extraction
Russell and Norvig;
● Information Extraction
● is automatically retrieving certain types of information from NL texts
● aims
– processing natural language texts
– retrieving occurrences of
● a particular class of objects
● relationships between these objects

● lies between
– Information retrieval systems
– Text understanding systems
Riloff;
● Information Extraction is a form of NLP in which certain types of information must be
● recognized
● extracted from text

Definitions
OBIE : Ontology Based Information Extraction
● OBIE has recently emerged as a subfield of IE.
● An OBIE system
● Processes
– Unstructured
– Structured

NL text through a mechanism guided by ontologies
● Extracts certain types of information
● Presents the output using ontologies
● An OBIE system is a set of information extractors each
extracting
● Individuals for a class
● Property values for a property

OBIE
Key Characteristics of OBIE Systems
● Process unstructured natural language text
e.g. text files
Process semi-structured natural language text
e.g. web pages using a particular template such as Wikipedia
pages
● Present the output using ontologies
● The use of a formal ontology as one of the system inputs
and the target output
● Constructing the ontology to be used through the IE process
● Use an IE process guided by an ontology
● The IE process guided by the ontology to extract things such
as classes, properties and instances

OBIE
Potential of OBIE Systems
● Automatically processing the information contained in natural
language text
● A large fraction of the information contained in the WWW
takes the form of natural language text
● Around %80 of the data of a typical corporation are in
natural language
● Creating semantic contexts for the Semantic Web
● The success of semantic web relies heavily on the existence
of semantic contents that can be processed by software
agents
the creation of such contents are quite slow
● Automatic meta-data creation would be the snowball to
unleash an avalanche of metadata through the web.

OBIE
Common Architecture of an OBIE System
User OBIE
System

Ontology
Query Editor
Answer
System

Information Ontology
Ontology Generator
Extraction
Knowledge Module
Base Extracted
/ Information Semantic
Database Preprocessor Lexicon

Text Input

Classification of OBIE Systems
Information Extraction Methods
● Linguistic Rules
● They are represented by regular expressions.
e.g. (watched|seen) <NP>
● Regular expressions are combined with the elements of ontologies such
as classes and properties.
● Gazetteer Lists
● Relies on finite-state automata
● Recognizes individual words or phrases instead of patterns
● Consists of a list of a set of words which identifies individual entities of a
particular category
e.g. the states of US, the countries of the world
2 conditions to construct a gazetteer list
● Specify exactly what is being identified by the gazetteer
● Specify where the information for the gazetteer list was obtained from

Information Extraction Methods
● Supervised classification techniques for Information Extraction
● Support Vector Machines
● Maximum Entropy Models
● Decision Trees
● Hidden Markov Models
● Conditional Random Fields
● Classifiers are trained to identify different components of an ontology such
as instances and property values
e.g. In Kylin, which is a tool developed by University of Washington, they use
● CRF model to identify attributes within a sentence
● ME model to predict which attribute values are present in a sentence.

Analyzing HTML/XML tags
● Some OBIE systems use html or xml pages as input
They can extract certain types of information from tables present in html
pages.
e.g. The SOBA system which is developed at Karlsruhen Institut for Tech.
Extracts information from HTML tables into a knowledgebase that uses
F-Logic.

Ontology Construction and Update
2 approaches
● Considering the ontology as an input to the system.
Ontology can be constructed manually.
An “off-the-shelf” ontology can also be used.
Systems adopting this approach :
● SOBA
● KIM
● PANKOW
● The paradigm of open information extraction which tells to construct an
ontology as a part of the OBIE process.
Systems adopting the approaches :
● Text-to-Onto
● Kylin

Summary of the classification of OBIE Systems

Tools used
Shallow NLP tools
● GATE
● SproUT
● Stanford NLP Group
● Center For Intelligent Information Retrieval (CIIR) at the Univ. of Massachusetts
● Saarbrücken Message Extracting System
Semantic lexicons
● WordNet
● GermaNet for German
● Hindi WordNet for Hindi
Ontology Editors
● Protege
● OntoEdit

YAGO
Introduction
● A light weight & extensible ontology with high coverage & quality.
● 1 million entities
5 million facts automatically extracted from Wikipedia unified with WordNet
● A carefully designed combination of heuristic & rule-based methods
● Contributions:
A major step beyond WordNet both in quality and in quantity
Empirical evaluation of fact correctness is %95.
● It is based on a clean model which is decidable, extensible and compatible
with RDFS.

YAGO
Background & Motivation
The need for
a huge ontology with knowledge from several sources
an ontology of high quality with accuracy close to 100 percent
an ontology have to comprise concepts not only in the style of WordNet but also
named entities like
● People
● Organizations
● Books
● Songs
● Products
An ontology have to be
● Extensible
● Easily re-usable
● Application independent

YAGO
Yet Another Great Ontology
● YAGO is based on Wikipedia's category pages.
Drawback : Wikipedia's bareness of hierarchies
Counterwise : WordNet provides a clean & carefully assembled hierarchy of
thousands of concepts
Problem : There is no mapping between Wikipedia and WordNet concepts
Proposal : New techniques to link them with perfect accuracy
● Contribution : Accomplishing the unification between WordNet and facts from
Wikipedia with an accuracy of 97%.
A data model which is based on entities & binary relations.
● General properties of relations and relations between relations can also be
expressed.
● It is designed to be extendable by other resources
● Other high quality resources
● Domain specific extensions
● Data gathered through Information Extraction

YAGO
Model
The state-of-the art formalism in knowledge representation
● OWL-Full : It can express properties of relations. It is decidable.
● OWL-Lite & OWL-DL : It can not express relations between facts
● RDFS : It can express relations between facts. It provides only very primitive
semantics (e.g. does not know transitivity)
● YAGO Model : It is an extension of RDFS
All objects are represented as entities.
e.g. AlbertEinstein
Two entities can stand in a relation.
e.g. AlbertEinstein HASWONPRIZE NobelPrize
number, dates, string and other literals are represented as entities as well.
e.g. AlbertEinstein BORNINYEAR 1879

YAGO
Model
● Words are entities as well.
e.g. “Einstein” MEANS AlbertEinstein
“Einstein” MEANS AlfredEinstein
● Similar entities are grouped into classes.
Each entity is an instance of at least one class. This is expressed by the TYPE
relation.
e.g. AlbertEinstein TYPE physicist
● Classes are also entities
Each class is an instance of a class called class
Classes are expressed in a taxonomic hiearchy, expressed by the subClassOf
relation.
e.g. physicist subClassOf scientist

YAGO
Model
● Relations are entities as well.
The properties of relations can be represented within the model.
e.g. subClassOf TYPE transitiveRelation
● A fact is a triple of
● an entity
● a relation
● an entity
● Two entities are called the arguments of the fact.
Each fact is given a fact identifier which is also an entity.
Assumption : #1 is the identifier of the fact (AlbertEinstein, BORNINYEAR, 1879)
then #1 FOUNDIN http://wikipedia.org/Einstein
● Common entities are entities which are neither facts nor relations.
Common entities that are not classes are called individuals.

YAGO
Ontology
● C : a finite set of common entities
R : a finite set of relation names
I : a finite set of fact identifiers
A YAGO ontology y is an injective & total function which is

y : I  I ∪C∪R∗R∗ I ∪C∪ R

● The set of relation names R for any YAGO ontology must contain at least :
● Type
● subClassOf
● Domain
● Range
● subRelationOf

YAGO
Ontology
The set of class names C for any YAGO ontology must contain at least :
● Entity
● Class
● Relation
● aCyclicTransitiveRelation
● Classes for all literals

YAGO
Knowledge Extraction – The TYPE Relation
● Each Wiki page title is a candidate to become an individual in YAGO
● The Wikipedia Category System :
Different types of categories
● Conceptual categories
e.g. Naturalized citizens of United States
● Administrative categories
e.g. Articles with unsourced statements
● Relational categories
e.g. 1879 births
● Thematic vicinity categories
e.g. Physics

YAGO
Knowledge Extraction – The MEANS Relation
● WordNet reveals the meaning of words by its synsets.
“urban center” & “metropolis” belong to the “city” synset
(“metropolis”, MEANS, city)
● Wikipedia redirect system gives alternative names for the entities.
(“Einstein, Albert”, MEANS, Albert Einstein)
● The relations GivenNameOf & FamilyNameOf are sub-relations of MEANS
A name parser is used to identify and to decompose person names.
(“Einstein”, FamilyNameOf, AlbertEinstein)

YAGO
Evaluation – Accuracy of YAGO

YAGO
Evaluation – Size of YAGO (facts)

YAGO
Evaluation – Size of YAGO (entities) and Size of
Other Ontologies

YAGO
Evaluation – Sample facts on YAGO

YAGO
Evaluation – Sample queries on YAGO

SOFIE
● 3 systems; YAGO, Kylin/KOG, Dbpedia;
● used IE methods for constructing large ontologies.
● leveraged high-quality hand crafted sources with latent knowledge for
collecting individual entities & facts.
● combined there results with a taxonomical hierarchy like, WordNet &
SUMO.
● Empirical assessment has shown that these approaches have an acc. > 95.
● They are close to the best hand-crafted rules.
● The resulting ontologies
● contain
– millions of entities
– 10s of millions of facts
● organized in a consistent manner by a transitive & acyclic subclass
relation.

SOFIE
● Next Stage : Expanding&Maintaining automatically compiled
ontologies as knowledge keeps evolving.
● Wikipedia's semi-structured knowledge is huge but limited.
● NL text sources such as news articles, biographies, scientific
pubs., full text Wiki articles must be brought into scope.
● Existing %80 accuracy is unacceptable for an ontological
knowledge base.
● Key Idea : To leverage existing ontology for its own growth :
● use trusted facts as a basis for generating good patterns.
● Scrutinize the resulting hypotheses with regard to their
consistency with the already knows facts.

SOFIE
Contribution
● 3 problems;
● Pattern selection
● Entity disambiguation
● Consistency checking
● Proposed Approach :
● A unified model for ontology oriented IE solving all 3 issues
simultaneously.
● They cast known facts, hypotheses for new facts, word-to-
entity mappings, gathered sets of patterns, a configurable
set of semantic constraints into a unified framework of
logical clauses.
● All 3 problems can be seen as a Weighted MAX-SAT problem,
which is task of identfying a maximal set of consisten clauses.

SOFIE
Contribution
Implementation :
● A new model for consistent growth of a large ontology.
● A unified model for
● Pattern selection
● Entity disambiguation
● Consistency checking
● Identification of the best hypotheses for new facts
● An efficient algorithm for the resulting Weighted MAX-SAT
algorithm.
● Experiments with a variety of real-life textual & semi-structured
sources to demonstrate the scalability & high-accuracy of the
approach.

SOFIE
Model
Assumption :
● An existing ontology
● A given corpus of documents
Statements :
● A word in context (wic) is a pair of word & context.
● The context of a word is simply the document in which the word
appears. Thus, a wic is a pair of word & document. The notation
being used is word@document.
● Wics are entities as well.
● Each statement can have an associated truth value of 1|0.

SOFIE
Model
Statements :
● A prefix notation is used for statements as SOPHIE deals with relations of
arbitrary arity.
e.g. bornIn(Albert Einstein, Ulm)[1]
A fact is a statement with truth value 1.
A statement with an unknown truth value is called a hypothesis.
● The ontology is considered as a set of facts.
e.g. bornOnDate(AlbertEinstein, 1879-03-14)[1]
● SOFIE will extract textual information from the corpus.
Textual information takes the forms of facts.
One type of facts makes assertions about the number of times that a pattern
occurred with two wics.

e.g. patternOcc(“X went to school in Y”, Einstein@29, Germany@29)[1]

SOFIE
Model
Statements :
● Another type of facts. How likely it is from a linguistic point of view that a wic refers to a certain entity. It
is called a “disambiguation prior”.

e.g. disambiguation prior of the wic Elvis@29
disambPrior(Elvis@29, Elvis Presley, 0.8)[1]
disambPrior(Elvis@29, Elvis Costello, 0.2)[1]
● Hypotheses that are formed by SOFIE
● Can concern the disambiguation of wics,
e.g. disambigauteAs(Java@DS, JavaProgrammingLanguage)[?]
● Can hypothesize about whether a certain pattern expresses a certain relation expresses (“X was
born in Y”, bornInLocation)
● Can hypothesize about potential new facts
e.g. developed(Microsoft, JavaProgrammingLanguage)[?]
● Unifying Framework :
● SOFIE unifies the domains of ontology & information extraction.
● For SOFIE, there exist only statements.
● SOFIE will try to figure out which hypotheses are likely to be true.

NLP in Web Data Extraction (Omer Gunes)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie NLP in Web Data Extraction (Omer Gunes)

Ähnlich wie NLP in Web Data Extraction (Omer Gunes) (20)

NLP in Web Data Extraction (Omer Gunes)