The Triplex Approach for Recognizing Semantic Relations from Noun Phrases, Appositions, and Adjectives
1. The Triplex Approach for
Recognizing Semantic Relations
from Noun Phrases,
Appositions, and Adjectives
Iman Mirrezaei, Bruno Martins, and Isabel F.
Cruz
ADVIS Lab, Department of Computer Science,
University of Illinois at Chicago, USA
Instituto Superior Tecnico, Universidade de
Lisboa, Portugal
1
1
2
2
1
2. Motivation
How to extract useful knowledge from
textual resources?
How to identify relations between
entities?
2
Microsoft is an American corporation headquartered in
Redmond , Washington
Michelle Obama (born January 17, 1964), an
American lawyer and writer, is the wife of the ...
3. Triples
Each triple represents an atomic fact by stating
a subject, a predicate (property) and an object
(value)
◦ e.g., “The sky has the color blue.” <the sky; has;
the color blue>
Triples can be expressed by verbs, or by
particular noun phrases in textual resources
◦ Verb-mediated formats
◦ Noun-mediated formats
An information extractor converts an input text
to a set of triples
3
4. Information extractors
Verb-mediated triple extractors
◦ TextRunner [Banko et al. 2007], WOE [Wu and Weld
2010], ReVerb [Fader et al. 2011], and OLLIE
[Mausam et al. 2012]
◦ e.g., “Obama will be elected President of the
United States” <Obama; will be elected;
President of the United States>
Noun-mediated triple extractors
◦ OLLIE: the first noun-mediated triple extractor
◦ OLLIE has patterns to extract noun-mediated
triples if they can also be expressed by a verb-
mediated format
◦ e.g., “Microsoft co-founder Bill Gates spoke at
the conference” <Bill Gates; be co-founder of;
Microsoft> 4
6. Noun-mediated triples
Noun-mediated triples can be expressed
through noun phrase with adjectives,
compound nouns and appositions
How to extract noun-mediated triples that are
not expressed via verb-mediated formats?
How to extract templates automatically from
text to generate noun-mediated triples?
6
8. The bootstrapping process
A sentence of a wiki page is extracted if it
contains an infobox value (object) and a synset
member (subject)
◦ The sentence is checked if there is a dependency
path between object and subject (noun, adjective, or
apposition dependencies)
◦ Tokens in the dependency paths between subject and
object are annotated with POS tags, lexical
constraints, WordNet synsets and named entity tags
Annotated paths are seen as extraction
templates
Constraint on the length of the dependency
path
8
9. Example
Microsoft Corporation is an American
multinational software corporation
headquartered in Redmond,
Washington that develops….
◦ vmod(corporation-8,
headquartered-9)
prep(headquartered-9, in-10)
nn(Washington-13, Redmond-11)
9
10. Microsoft is an American corporation headquartered in Redmond , Washington
NNP VBZ DT JJ NN VBN IN NNP , NNP
ORG O O MISC O O O LOC O LOC
O O O O ORG O O O O O
Infobox name: Headquarters
Infobox value: Redmond, Washington
Range of headquarters : Location
Synset member: Corporation
Synset member type: Organization
Lexical constraint: Headquarter in
Microsoft
corporation
Coreference
nn
vmod
prep-in
O O O O Subject O O Object
O: No label PER: person NUM: number ORG:
organization
Example
10
POS tags
Named Entities
WordNet synsets
Occurrences of
subject and object
11. Templates
Templates express how a class of triples is
expressed in a sentence.
◦ Deep syntactic features: dependencies
◦ Shallow syntactic features: POS tags, noun
phrases
◦ Lexical features
◦ Named entity types: WordNet synsets
◦ Property ranges (Person, Organization,
Location, or unknown)
11
12. Triplex
Confidence score for triples
◦ A logistic regression classifier
◦ Features: frequency of the extraction
templates, existence of lexical words, range
of properties, semantic object type
Template matching
◦ Recognizing candidate subjects by NER
types and WordNet synsets
◦ The dependency paths between subject and
all potential objects are annotated
◦ Matching with templates
12
13. Evaluation
Automatic evaluation according to the
procedure suggested by Bronzi et al.[2012]
◦ 1000 random sentences from Wikipedia
◦ Create a gold standard by using PMI, DBPedia,
and Freebase
Manual evaluation
◦ 50 random sentences from Wikipedia
◦ The agreement between the automatic and
manual evaluation is about .71
13
14. The gold standard
A fact is a triple <subject, property, object>
All possible entities are recognized by NER
types and WordNet synsets
All verbs(predicates) are detected by the
Stanford CoreNLP and predicates are
expanded by adding DBPedia and
Freebase properties
All extracted facts of sentences are verified
by
◦ DBPedia
◦ Freebase
14
16. Error analysis
Missed extractions
10% No semantic types
12% Dependency parser problems
7% Coreferencing errors
6% Over-generalized templates
65% Verb-mediated triples (outside the of scope for Triplex)
16
17. Correctly extracted triples
Distribution Triple category
Noun-
mediated
12%
Conjonctions, adjectives
and noun phrases
9%
Apposition and parenthetical
phrases
6% Titles or professions
8% Templates with lexicon
Verb-
mediated
65%
Verb-mediated triples
17
18. Conclusion
Triplex generates noun-mediated
triples from compound nouns,
adjective, and appositions
Triplex complements the output of
verb-mediated triple extractors
IE systems like Triplex can assist
authors to annotate Wikipedia pages
(recognize missing infobox values)
18
19. Future works
Improve results for triples involving
numerical values with different units
(i.e., square meter, meter)
Enrich the bootstrapping process by
using a probabilistic
knowledgebase(e.g., Probase [2012])
19
20. References
M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni:
Open Information Extraction for the Web. In: International Joint
Conferences on Artificial Intelligence (IJCAI). pp. 2670–2676 (2007)
A. Fader, S. Soderland, O. Etzioni: Identifying Relations for Open
Information Extraction. In: Conference on Empirical Methods in
Natural Language Processing. pp. 1535–1545 (2011)
Mausam, M. Schmitz, R. Bart, S. Soderland, O. Etzioni: Open
Language Learning for Information Extraction. In: Joint Conference
on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning. pp. 523–534 (2012)
F. Wu, and D.S. Weld: Open Information Extraction Using Wikipedia.
In: Annual Meeting of the Association for Computational Linguistics.
pp. 118–127 (2010)
M. Bronzi, Z. Guo, F. Mesquita, D. Barbosa, P. Merialdo : Automatic
Evaluation of Relation Extraction Systems on Large-scale. In: Joint
Workshop on Automatic Knowledge Base Construction and Web-
scale Knowledge Extraction. pp. 19–24 (2012)
W. Wu, H. Li, H. Wang, K.Q. Zhu: Probase: A Probabilistic Taxonomy
for Text Understanding. In: ACM SIGMOD International Conference
on Management of Data. pp. 481–492 (2012)
20