Take control of your SAP testing with UiPath Test Suite
A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction
1. A CROSS-LINGUAL ANNOTATION PROJECTION-
BASED SELF-SUPERVISION APPROACH
FOR OPEN INFORMATION EXTRACTION
The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011)
November 10th, 2011, Chiang Mai
Seokhwan Kim (POSTECH)
Minwoo Jeong (Microsoft Bing)
Jonghoon Lee (POSTECH)
Gary Geunbae Lee (POSTECH)
2. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
2
3. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
3
4. Information Extraction
• Goal
To generate structured information from natural language
documents
• Representing semantic relationships among a set of arguments
Birthday
Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.
Birthplace
Person Barack Obama
Birthday August 4, 1961
Birthplace Honolulu
4
5. Previous Approaches
• Many supervised machine learning approaches have been
successfully applied to the RDC task
(Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta
and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,
2006)
Large amounts of training data are required
• Weakly-supervised techniques have been sought
(Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)
To learn the IE system without significant annotation effort
• Open Information Extraction
(Banko et al., 2007; Wu and Weld, 2010)
5
6. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
6
7. Open Information Extraction
• An alternative weakly-supervised IE paradigm
(Banko et al., 2007)
• Problem Definition
������: ������ → ������������ , ������������,������ , ������������ 1 ≤ ������, ������ ≤ ������
Binary relation extraction between ei and ej
Considering relationships explicitly represented by ri,j
• Goal
Large-scale IE
• Domain-independent
• Relation-independent
Without hand-crafted rules or hand-annotated training examples
7
8. How to Eliminate Human Supervision
• Self-supervised Learning for Open IE
Using automatically obtained training examples
• From external knowledge
• Previous Systems
TextRunner (Banko et al., 2007)
• Penn Treebank
• A small set of heuristics about syntactic structural constraints
WoE (Wu and Weld, 2010)
• Wikipedia articles
• Wikipedia Infoboxes
8
9. What’s the Problem?
• Previous approaches mainly depend on language-specific
knowledge for English
Heuristic-based Approach
• Syntactic treebank for the target language
• Heuristics designed for the target language
Wikipedia-based Approach
• Wikipedia articles and infoboxes are available not only for English
• Differences among languages in the amount of available resources
English Wikipedia: 3,500,000 articles
Korean Wikipedia: 150,000 articles
9
10. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
10
11. Cross-lingual Annotation Projection
• Goal
To obtain training examples for the target language LT
• Method
To leverage parallel corpora to project the annotations on the
source language LS to the target language LT
The premise is that parallel corpora between LS and LT are much
easier to obtain than the task-specific training dataset for LT
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .
버락 오바마 는 하와이 의 호놀룰루 에서 태어났다
(beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da)
<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
11
12. Cross-lingual Annotation Projection
• Previous Work
Part-of-speech tagging (Yarowsky and Ngai, 2001)
Named-entity tagging (Yarowsky et al., 2001)
Verb classification (Merlo et al., 2002)
Dependency parsing (Hwa et al., 2005)
Mention detection (Zitouni and Florian, 2008)
Semantic role labeling (Pado and Lapata, 2009)
• To the best of our knowledge, no work has reported on the
Open IE task
12
13. Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
13
14. Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
Barack Obama was born in Honolulu , Hawaii .
14
15. Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
Barack Obama was born in Honolulu , Hawaii .
15
16. Annotation
• To obtain annotations for the sentences in LS
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, extraction is performed
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .
16
17. Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
17
18. Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .
버락 오바마 는 하와이 의 호놀룰루 에서 태어났다
(beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da)
18
19. Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .
버락 오바마 는 하와이 의 호놀룰루 에서 태어났다
(beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da)
19
20. Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .
버락 오바마 는 하와이 의 호놀룰루 에서 태어났다
(beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da)
20
21. Projection
• To project the annotations from the sentences in LS onto
the sentences in LT using word alignment information
• Procedure
A set of entities in the given sentence is identified
Each instance is composed of a pair of entities
For each instance, the existence of relationship is determined
If the instance is positive, the contextual subtext is projected
<e1, r12, e2> = <Barack Obama, was born in, Honolulu>
Barack Obama was born in Honolulu , Hawaii .
버락 오바마 는 하와이 의 호놀룰루 에서 태어났다
(beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da)
<e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
21
22. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
22
23. Overall Architecture
English-
Korean Raw
Korean Parallel
Text
Corpus
Self-
Learning Extraction
Supervision
Korean
Korean Open Extracted
Annotated
IE Model Results
Corpus
23
24. Cross-lingual Annotation Projection-
based Self-Supervision
Annotation Parallel
Projection
Corpus
English Korean
Sentences Sentences
Korean
English Preprocessors
Preprocessors
Word Alignment
English Open IE
System
Projection
English
Annotated
Corpus Korean
Annotated
Corpus 24
25. Cross-lingual Annotation Projection-
based Self-Supervision
• Dataset
English-Korean Parallel Corpus
• 266,892 bi-sentence pairs in English and Korean
• Preprocessors
English
• OpenNLP toolkit
Korean
• Espresso toolkit
25
26. Cross-lingual Annotation Projection-
based Self-Supervision
• English Open IE
Our own implementation of the Banko’s method
• Dataset
The WSJ part of Penn Treebank
By applying a series of heuristics (Banko, 2009)
1,028,361 instances from 49,208 sentences (9.0% were positive)
• Model
Conditional Random Fields (CRF)
• With Lexical and POS tag features
• CRF++ toolkit
26
27. Cross-lingual Annotation Projection-
based Self-Supervision
• Word Alignment
Aligned by GIZA++ toolkit
• In the standard configuration in both directions
• The bi-directional alignments were joined using the grow-diag-final
algorithm
Chunk-based Reorganization
• To reduce the word alignment errors
• Generating alignments between pairs of base phrase chunks
• Using a simple greedy algorithm
Based on the overlap score of aligned words between base phrase chunks
27
29. Learning & Extraction
• Extractor for Korean Open IE
Maximum Entropy (ME) model
• To detect whether or not each given instance is positive
• Features
Lexical, POS Tag
On the dependency path
• Maximum Entropy Modeling toolkit
Conditional Random Fields (CRF) model
• To identify the contextual subtext indicating the semantic relationship
• Features
Lexical, POS Tag
On the dependency path
• CRF++ toolkit
29
30. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
30
31. Evaluation #1
• Dataset
250 sentences from Korean Wikipedia articles
With manually annotated gold standard
• 1,434 instances
• 308 positive instances
• Baseline
Heuristic-based System
• Sejong treebank corpus (Korean)
• A set of heuristics utilized for the English Open IE system except
language-specific rules
31
32. Evaluation #1
• Comparison of performances
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
32
33. Evaluation #1
• Comparison of performances
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
33
34. Evaluation #1
• Comparison of performances
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
34
35. Evaluation #1
• Comparison of performances
Model P R F
Heuristic 47.7 20.1 28.3
Projection 33.6 49.0 39.8
Heuristic + Projection 41.9 46.4 44.1
35
36. Evaluation #2
• Datasets
Korean Newswire
• 302,276 documents
• 2,565,487 sentences
Korean Wikipedia
• 123,000 articles
• 1,342,003 sentences
• Manual Evaluation
For four relation types
• BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF
36
37. Evaluation #2
• Evaluation results for four relation types
Newswire Wikipedia
Type
precision # of extractions precision # of extractions
Birth Place 65.2 256 69.1 971
Won Award 57.4 824 63.3 286
Acquisition 67.0 1112 50.3 143
Invent Of 53.1 32 47.6 103
3,727 extractions with a precision of 63.7% for four relation types
37
38. Evaluation #2
• Distribution of the errors
Error Type # of errors
Chunking Error 364 (26.9%)
Dependency Parsing Error 461 (34.1%)
Extracting Error 527 (39.0%)
38
39. Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions
39
40. Conclusions
• Summary
A Cross-lingual Annotation Projection Approach for Open IE
Korean Open IE system developed using an English Open IE
system and an English-Korean parallel corpus
Our system outperformed the heuristic-based system
Our system achieved 63.7% in precision from a large-scale
evaluation
• Ongoing Work
Reducing sensitivity to the errors committed by preprocessors
Investigating hybrid approaches considering various external
knowledge sources
40