SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
A CROSS-LINGUAL ANNOTATION PROJECTION-
   BASED SELF-SUPERVISION APPROACH
   FOR OPEN INFORMATION EXTRACTION

  The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011)
                             November 10th, 2011, Chiang Mai

                          Seokhwan Kim (POSTECH)
                          Minwoo Jeong (Microsoft Bing)
                            Jonghoon Lee (POSTECH)
                          Gary Geunbae Lee (POSTECH)
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        2
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        3
Information Extraction
• Goal
   To generate structured information from natural language
    documents
      • Representing semantic relationships among a set of arguments


                     Birthday



 Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii.

                                Birthplace




                     Person          Barack Obama
                     Birthday        August 4, 1961
                     Birthplace      Honolulu
                                                                       4
Previous Approaches
• Many supervised machine learning approaches have been
  successfully applied to the RDC task
    (Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta
     and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al.,
     2006)
    Large amounts of training data are required
• Weakly-supervised techniques have been sought
    (Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)
    To learn the IE system without significant annotation effort
• Open Information Extraction
    (Banko et al., 2007; Wu and Weld, 2010)

                                                                          5
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        6
Open Information Extraction
• An alternative weakly-supervised IE paradigm
    (Banko et al., 2007)
• Problem Definition
                   ������: ������ →     ������������ , ������������,������ , ������������ 1 ≤ ������, ������ ≤ ������
    Binary relation extraction between ei and ej
    Considering relationships explicitly represented by ri,j
• Goal
    Large-scale IE
       • Domain-independent
       • Relation-independent
    Without hand-crafted rules or hand-annotated training examples

                                                                        7
How to Eliminate Human Supervision
• Self-supervised Learning for Open IE
    Using automatically obtained training examples
      • From external knowledge

• Previous Systems
    TextRunner (Banko et al., 2007)
      • Penn Treebank
      • A small set of heuristics about syntactic structural constraints
    WoE (Wu and Weld, 2010)
      • Wikipedia articles
      • Wikipedia Infoboxes




                                                                           8
What’s the Problem?
• Previous approaches mainly depend on language-specific
  knowledge for English
    Heuristic-based Approach
      • Syntactic treebank for the target language
      • Heuristics designed for the target language
    Wikipedia-based Approach
      • Wikipedia articles and infoboxes are available not only for English
      • Differences among languages in the amount of available resources
           English Wikipedia: 3,500,000 articles
           Korean Wikipedia: 150,000 articles




                                                                              9
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        10
Cross-lingual Annotation Projection
• Goal
   To obtain training examples for the target language LT
• Method
   To leverage parallel corpora to project the annotations on the
    source language LS to the target language LT
   The premise is that parallel corpora between LS and LT are much
    easier to obtain than the task-specific training dataset for LT

          <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
      Barack Obama was born in Honolulu , Hawaii .


   버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
   (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


   <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
                                                                                                 11
Cross-lingual Annotation Projection
• Previous Work
    Part-of-speech tagging (Yarowsky and Ngai, 2001)
    Named-entity tagging (Yarowsky et al., 2001)
    Verb classification (Merlo et al., 2002)
    Dependency parsing (Hwa et al., 2005)
    Mention detection (Zitouni and Florian, 2008)
    Semantic role labeling (Pado and Lapata, 2009)
• To the best of our knowledge, no work has reported on the
  Open IE task



                                                         12
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed




                                                             13
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed




      Barack Obama        was born in Honolulu       , Hawaii   .




                                                                    14
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed




      Barack Obama was born in Honolulu , Hawaii .




                                                             15
Annotation
• To obtain annotations for the sentences in LS
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, extraction is performed


         <e1, r12, e2> = <Barack Obama, was born in, Honolulu>

      Barack Obama was born in Honolulu , Hawaii .




                                                                 16
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected




                                                                        17
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
            <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마                는       하와이         의      호놀룰루              에서        태어났다
     (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


                                                                                                   18
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
           <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
    (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


                                                                                                  19
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
           <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
    (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


                                                                                                  20
Projection
• To project the annotations from the sentences in LS onto
  the sentences in LT using word alignment information
• Procedure
    A set of entities in the given sentence is identified
    Each instance is composed of a pair of entities
    For each instance, the existence of relationship is determined
    If the instance is positive, the contextual subtext is projected
           <e1, r12, e2> = <Barack Obama, was born in, Honolulu>
       Barack Obama was born in Honolulu , Hawaii .


    버락 오바마               는       하와이         의      호놀룰루              에서        태어났다
    (beo-rak-o-ba-ma)   (neun)   (ha-wa-i)   (ui)   (ho-nol-rul-ru)   (e-seo)   (tae-eo-nat-da)


    <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru>
                                                                                                  21
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        22
Overall Architecture

   English-
                                 Korean Raw
Korean Parallel
                                    Text
    Corpus




   Self-
                  Learning      Extraction
Supervision




   Korean
                  Korean Open     Extracted
  Annotated
                    IE Model       Results
   Corpus




                                              23
Cross-lingual Annotation Projection-
      based Self-Supervision
  Annotation                Parallel
                                                      Projection
                            Corpus


                English                   Korean
               Sentences                 Sentences




                                          Korean
             English                   Preprocessors
          Preprocessors



                                       Word Alignment
          English Open IE
              System


                                         Projection
                English
               Annotated
                Corpus                    Korean
                                         Annotated
                                          Corpus                   24
Cross-lingual Annotation Projection-
       based Self-Supervision
• Dataset
    English-Korean Parallel Corpus
      • 266,892 bi-sentence pairs in English and Korean

• Preprocessors
    English
      • OpenNLP toolkit
    Korean
      • Espresso toolkit




                                                          25
Cross-lingual Annotation Projection-
       based Self-Supervision
• English Open IE
    Our own implementation of the Banko’s method
      • Dataset
           The WSJ part of Penn Treebank
           By applying a series of heuristics (Banko, 2009)
           1,028,361 instances from 49,208 sentences (9.0% were positive)
      • Model
           Conditional Random Fields (CRF)
                • With Lexical and POS tag features
                • CRF++ toolkit




                                                                             26
Cross-lingual Annotation Projection-
       based Self-Supervision
• Word Alignment
   Aligned by GIZA++ toolkit
     • In the standard configuration in both directions
     • The bi-directional alignments were joined using the grow-diag-final
       algorithm
   Chunk-based Reorganization
     • To reduce the word alignment errors
     • Generating alignments between pairs of base phrase chunks
     • Using a simple greedy algorithm
          Based on the overlap score of aligned words between base phrase chunks




                                                                               27
Cross-lingual Annotation Projection-
       based Self-Supervision
• Annotated Dataset
    English
    598,115 instances
      • 169.771 positive instances

• Projected Dataset
    Korean
    278,730 instances
      • 89,743 positive instances




                                     28
Learning & Extraction
• Extractor for Korean Open IE
    Maximum Entropy (ME) model
      • To detect whether or not each given instance is positive
      • Features
           Lexical, POS Tag
           On the dependency path
      • Maximum Entropy Modeling toolkit
    Conditional Random Fields (CRF) model
      • To identify the contextual subtext indicating the semantic relationship
      • Features
           Lexical, POS Tag
           On the dependency path
      • CRF++ toolkit


                                                                              29
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        30
Evaluation #1
• Dataset
    250 sentences from Korean Wikipedia articles
    With manually annotated gold standard
      • 1,434 instances
      • 308 positive instances

• Baseline
    Heuristic-based System
      • Sejong treebank corpus (Korean)
      • A set of heuristics utilized for the English Open IE system except
        language-specific rules




                                                                             31
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   32
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   33
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   34
Evaluation #1
• Comparison of performances



                Model              P    R    F
               Heuristic          47.7 20.1 28.3
              Projection          33.6 49.0 39.8
         Heuristic + Projection   41.9 46.4 44.1




                                                   35
Evaluation #2
• Datasets
    Korean Newswire
       • 302,276 documents
       • 2,565,487 sentences
    Korean Wikipedia
       • 123,000 articles
       • 1,342,003 sentences

• Manual Evaluation
    For four relation types
       • BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF




                                                          36
Evaluation #2
• Evaluation results for four relation types

                              Newswire                          Wikipedia
     Type
                  precision     # of extractions    precision     # of extractions
  Birth Place       65.2              256             69.1              971
  Won Award         57.4              824             63.3              286
  Acquisition       67.0             1112             50.3              143
  Invent Of         53.1              32              47.6              103




       3,727 extractions with a precision of 63.7% for four relation types



                                                                                 37
Evaluation #2
• Distribution of the errors



             Error Type                 # of errors
             Chunking Error             364 (26.9%)
             Dependency Parsing Error   461 (34.1%)
             Extracting Error           527 (39.0%)




                                                      38
Contents
• Introduction
• Open Information Extraction
• Cross-Lingual Annotation Projection
• Implementation
• Evaluation
• Conclusions




                                        39
Conclusions
• Summary
   A Cross-lingual Annotation Projection Approach for Open IE
   Korean Open IE system developed using an English Open IE
    system and an English-Korean parallel corpus
   Our system outperformed the heuristic-based system
   Our system achieved 63.7% in precision from a large-scale
    evaluation
• Ongoing Work
   Reducing sensitivity to the errors committed by preprocessors
   Investigating hybrid approaches considering various external
    knowledge sources


                                                                    40
Q&A

Weitere ähnliche Inhalte

Andere mochten auch

Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingSeokhwan Kim
 
A Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionA Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionSeokhwan Kim
 
jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告zhangsuoyong
 
Cancer al utero
Cancer al uteroCancer al utero
Cancer al uterorenacer_02
 
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...Seokhwan Kim
 
张所勇:前端开发工具推荐
张所勇:前端开发工具推荐张所勇:前端开发工具推荐
张所勇:前端开发工具推荐zhangsuoyong
 
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템Seokhwan Kim
 

Andere mochten auch (8)

Wikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic TrackingWikipedia-based Kernels for Dialogue Topic Tracking
Wikipedia-based Kernels for Dialogue Topic Tracking
 
офис мечты
офис мечтыофис мечты
офис мечты
 
A Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation DetectionA Cross-Lingual Annotation Projection Approach for Relation Detection
A Cross-Lingual Annotation Projection Approach for Relation Detection
 
jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告jiaju.com首页前端优化一期报告
jiaju.com首页前端优化一期报告
 
Cancer al utero
Cancer al uteroCancer al utero
Cancer al utero
 
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relatio...
 
张所勇:前端开发工具推荐
张所勇:前端开发工具推荐张所勇:前端开发工具推荐
张所勇:前端开发工具推荐
 
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
EPG 정보 검색을 위한 예제 기반 자연어 대화 시스템
 

Ähnlich wie A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
 
Working with big biomedical ontologies
Working with big biomedical ontologiesWorking with big biomedical ontologies
Working with big biomedical ontologiesrobertstevens65
 
Formalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementationFormalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementationgolpedegato2
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsMelanie Courtot
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Natural Language Inference in SICK
Natural Language Inference in SICKNatural Language Inference in SICK
Natural Language Inference in SICKValeria de Paiva
 
[word]
[word][word]
[word]butest
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018Andre Freitas
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesRokan Uddin Faruqui
 
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...Alp Öktem
 

Ähnlich wie A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction (17)

Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...
 
Working with big biomedical ontologies
Working with big biomedical ontologiesWorking with big biomedical ontologies
Working with big biomedical ontologies
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Formalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementationFormalization and implementation of BFO 2 with a focus on the OWL implementation
Formalization and implementation of BFO 2 with a focus on the OWL implementation
 
Building OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web toolsBuilding OBO Foundry ontology using semantic web tools
Building OBO Foundry ontology using semantic web tools
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Survey on Open IE
Survey on Open IESurvey on Open IE
Survey on Open IE
 
AI Lesson 41
AI Lesson 41AI Lesson 41
AI Lesson 41
 
Lesson 41
Lesson 41Lesson 41
Lesson 41
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Natural Language Inference in SICK
Natural Language Inference in SICKNatural Language Inference in SICK
Natural Language Inference in SICK
 
Meghyn slides-hse-2014
Meghyn slides-hse-2014Meghyn slides-hse-2014
Meghyn slides-hse-2014
 
[word]
[word][word]
[word]
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxesOwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
OwlOntDB: A Scalable Reasoning System for OWL 2 RL Ontologies with Large ABoxes
 
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
SLSP 2017 presentation - Attentional Parallel RNNs for Generating Punctuation...
 
A Proposition Bank of Urdu
A Proposition Bank of UrduA Proposition Bank of Urdu
A Proposition Bank of Urdu
 

Mehr von Seokhwan Kim

The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)Seokhwan Kim
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Seokhwan Kim
 
Dynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingDynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingSeokhwan Kim
 
The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)Seokhwan Kim
 
Natural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionNatural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionSeokhwan Kim
 
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Seokhwan Kim
 
The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)Seokhwan Kim
 
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Seokhwan Kim
 
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Seokhwan Kim
 
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...Seokhwan Kim
 
Sequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSeokhwan Kim
 
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...Seokhwan Kim
 
MMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionMMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionSeokhwan Kim
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...Seokhwan Kim
 
A spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessA spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessSeokhwan Kim
 
An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...Seokhwan Kim
 
An Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionAn Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionSeokhwan Kim
 

Mehr von Seokhwan Kim (17)

The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)The Eighth Dialog System Technology Challenge (DSTC8)
The Eighth Dialog System Technology Challenge (DSTC8)
 
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
Deep Recurrent Neural Networks with Layer-wise Multi-head Attentions for Punc...
 
Dynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic TrackingDynamic Memory Networks for Dialogue Topic Tracking
Dynamic Memory Networks for Dialogue Topic Tracking
 
The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)The Fifth Dialog State Tracking Challenge (DSTC5)
The Fifth Dialog State Tracking Challenge (DSTC5)
 
Natural Language in Human-Robot Interaction
Natural Language in Human-Robot InteractionNatural Language in Human-Robot Interaction
Natural Language in Human-Robot Interaction
 
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
Exploring Convolutional and Recurrent Neural Networks in Sequential Labelling...
 
The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)The Fourth Dialog State Tracking Challenge (DSTC4)
The Fourth Dialog State Tracking Challenge (DSTC4)
 
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
Wikification of Concept Mentions within Spoken Dialogues Using Domain Constra...
 
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
Towards Improving Dialogue Topic Tracking Performances with Wikification of C...
 
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
A Composite Kernel Approach for Dialog Topic Tracking with Structured Domain ...
 
Sequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog StatesSequential Labeling for Tracking Dynamic Dialog States
Sequential Labeling for Tracking Dynamic Dialog States
 
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
A Graph-based Cross-lingual Projection Approach for Spoken Language Understan...
 
MMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognitionMMR-based active machine learning for Bio named entity recognition
MMR-based active machine learning for Bio named entity recognition
 
A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...A semi-supervised method for efficient construction of statistical spoken lan...
A semi-supervised method for efficient construction of statistical spoken lan...
 
A spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information accessA spoken dialog system for electronic program guide information access
A spoken dialog system for electronic program guide information access
 
An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...An alignment-based approach to semi-supervised relation extraction including ...
An alignment-based approach to semi-supervised relation extraction including ...
 
An Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information ExtractionAn Alignment-based Pattern Representation Model for Information Extraction
An Alignment-based Pattern Representation Model for Information Extraction
 

Kürzlich hochgeladen

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

A Cross-lingual Annotation Projection-based Self-supervision Approach for Open Information Extraction

  • 1. A CROSS-LINGUAL ANNOTATION PROJECTION- BASED SELF-SUPERVISION APPROACH FOR OPEN INFORMATION EXTRACTION The 5th International Joint Conference on Natural Language Processing (IJCNLP 2011) November 10th, 2011, Chiang Mai Seokhwan Kim (POSTECH) Minwoo Jeong (Microsoft Bing) Jonghoon Lee (POSTECH) Gary Geunbae Lee (POSTECH)
  • 2. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 2
  • 3. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 3
  • 4. Information Extraction • Goal  To generate structured information from natural language documents • Representing semantic relationships among a set of arguments Birthday Barack Obama was born on August 4, 1961 , in Honolulu , Hawaii. Birthplace Person Barack Obama Birthday August 4, 1961 Birthplace Honolulu 4
  • 5. Previous Approaches • Many supervised machine learning approaches have been successfully applied to the RDC task  (Kambhatla, 2004; Zhou et al., 2005; Zelenko et al., 2003; Culotta and Sorensen, 2004; Bunescu and Mooney, 2005; Zhang et al., 2006)  Large amounts of training data are required • Weakly-supervised techniques have been sought  (Zhang, 2004; Chen et al., 2006; Zhou et al., 2009)  To learn the IE system without significant annotation effort • Open Information Extraction  (Banko et al., 2007; Wu and Weld, 2010) 5
  • 6. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 6
  • 7. Open Information Extraction • An alternative weakly-supervised IE paradigm  (Banko et al., 2007) • Problem Definition ������: ������ → ������������ , ������������,������ , ������������ 1 ≤ ������, ������ ≤ ������  Binary relation extraction between ei and ej  Considering relationships explicitly represented by ri,j • Goal  Large-scale IE • Domain-independent • Relation-independent  Without hand-crafted rules or hand-annotated training examples 7
  • 8. How to Eliminate Human Supervision • Self-supervised Learning for Open IE  Using automatically obtained training examples • From external knowledge • Previous Systems  TextRunner (Banko et al., 2007) • Penn Treebank • A small set of heuristics about syntactic structural constraints  WoE (Wu and Weld, 2010) • Wikipedia articles • Wikipedia Infoboxes 8
  • 9. What’s the Problem? • Previous approaches mainly depend on language-specific knowledge for English  Heuristic-based Approach • Syntactic treebank for the target language • Heuristics designed for the target language  Wikipedia-based Approach • Wikipedia articles and infoboxes are available not only for English • Differences among languages in the amount of available resources  English Wikipedia: 3,500,000 articles  Korean Wikipedia: 150,000 articles 9
  • 10. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 10
  • 11. Cross-lingual Annotation Projection • Goal  To obtain training examples for the target language LT • Method  To leverage parallel corpora to project the annotations on the source language LS to the target language LT  The premise is that parallel corpora between LS and LT are much easier to obtain than the task-specific training dataset for LT <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru> 11
  • 12. Cross-lingual Annotation Projection • Previous Work  Part-of-speech tagging (Yarowsky and Ngai, 2001)  Named-entity tagging (Yarowsky et al., 2001)  Verb classification (Merlo et al., 2002)  Dependency parsing (Hwa et al., 2005)  Mention detection (Zitouni and Florian, 2008)  Semantic role labeling (Pado and Lapata, 2009) • To the best of our knowledge, no work has reported on the Open IE task 12
  • 13. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed 13
  • 14. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed Barack Obama was born in Honolulu , Hawaii . 14
  • 15. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed Barack Obama was born in Honolulu , Hawaii . 15
  • 16. Annotation • To obtain annotations for the sentences in LS • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, extraction is performed <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 16
  • 17. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected 17
  • 18. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 18
  • 19. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 19
  • 20. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) 20
  • 21. Projection • To project the annotations from the sentences in LS onto the sentences in LT using word alignment information • Procedure  A set of entities in the given sentence is identified  Each instance is composed of a pair of entities  For each instance, the existence of relationship is determined  If the instance is positive, the contextual subtext is projected <e1, r12, e2> = <Barack Obama, was born in, Honolulu> Barack Obama was born in Honolulu , Hawaii . 버락 오바마 는 하와이 의 호놀룰루 에서 태어났다 (beo-rak-o-ba-ma) (neun) (ha-wa-i) (ui) (ho-nol-rul-ru) (e-seo) (tae-eo-nat-da) <e1, r13, e3> = <beo-rak-o-ba-ma, e-seo tae-eo-nat-da, ho-nol-rul-ru> 21
  • 22. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 22
  • 23. Overall Architecture English- Korean Raw Korean Parallel Text Corpus Self- Learning Extraction Supervision Korean Korean Open Extracted Annotated IE Model Results Corpus 23
  • 24. Cross-lingual Annotation Projection- based Self-Supervision Annotation Parallel Projection Corpus English Korean Sentences Sentences Korean English Preprocessors Preprocessors Word Alignment English Open IE System Projection English Annotated Corpus Korean Annotated Corpus 24
  • 25. Cross-lingual Annotation Projection- based Self-Supervision • Dataset  English-Korean Parallel Corpus • 266,892 bi-sentence pairs in English and Korean • Preprocessors  English • OpenNLP toolkit  Korean • Espresso toolkit 25
  • 26. Cross-lingual Annotation Projection- based Self-Supervision • English Open IE  Our own implementation of the Banko’s method • Dataset  The WSJ part of Penn Treebank  By applying a series of heuristics (Banko, 2009)  1,028,361 instances from 49,208 sentences (9.0% were positive) • Model  Conditional Random Fields (CRF) • With Lexical and POS tag features • CRF++ toolkit 26
  • 27. Cross-lingual Annotation Projection- based Self-Supervision • Word Alignment  Aligned by GIZA++ toolkit • In the standard configuration in both directions • The bi-directional alignments were joined using the grow-diag-final algorithm  Chunk-based Reorganization • To reduce the word alignment errors • Generating alignments between pairs of base phrase chunks • Using a simple greedy algorithm  Based on the overlap score of aligned words between base phrase chunks 27
  • 28. Cross-lingual Annotation Projection- based Self-Supervision • Annotated Dataset  English  598,115 instances • 169.771 positive instances • Projected Dataset  Korean  278,730 instances • 89,743 positive instances 28
  • 29. Learning & Extraction • Extractor for Korean Open IE  Maximum Entropy (ME) model • To detect whether or not each given instance is positive • Features  Lexical, POS Tag  On the dependency path • Maximum Entropy Modeling toolkit  Conditional Random Fields (CRF) model • To identify the contextual subtext indicating the semantic relationship • Features  Lexical, POS Tag  On the dependency path • CRF++ toolkit 29
  • 30. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 30
  • 31. Evaluation #1 • Dataset  250 sentences from Korean Wikipedia articles  With manually annotated gold standard • 1,434 instances • 308 positive instances • Baseline  Heuristic-based System • Sejong treebank corpus (Korean) • A set of heuristics utilized for the English Open IE system except language-specific rules 31
  • 32. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 32
  • 33. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 33
  • 34. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 34
  • 35. Evaluation #1 • Comparison of performances Model P R F Heuristic 47.7 20.1 28.3 Projection 33.6 49.0 39.8 Heuristic + Projection 41.9 46.4 44.1 35
  • 36. Evaluation #2 • Datasets  Korean Newswire • 302,276 documents • 2,565,487 sentences  Korean Wikipedia • 123,000 articles • 1,342,003 sentences • Manual Evaluation  For four relation types • BIRTH_PLACE, WON_AWARD, ACQUISITION, INVENT_OF 36
  • 37. Evaluation #2 • Evaluation results for four relation types Newswire Wikipedia Type precision # of extractions precision # of extractions Birth Place 65.2 256 69.1 971 Won Award 57.4 824 63.3 286 Acquisition 67.0 1112 50.3 143 Invent Of 53.1 32 47.6 103 3,727 extractions with a precision of 63.7% for four relation types 37
  • 38. Evaluation #2 • Distribution of the errors Error Type # of errors Chunking Error 364 (26.9%) Dependency Parsing Error 461 (34.1%) Extracting Error 527 (39.0%) 38
  • 39. Contents • Introduction • Open Information Extraction • Cross-Lingual Annotation Projection • Implementation • Evaluation • Conclusions 39
  • 40. Conclusions • Summary  A Cross-lingual Annotation Projection Approach for Open IE  Korean Open IE system developed using an English Open IE system and an English-Korean parallel corpus  Our system outperformed the heuristic-based system  Our system achieved 63.7% in precision from a large-scale evaluation • Ongoing Work  Reducing sensitivity to the errors committed by preprocessors  Investigating hybrid approaches considering various external knowledge sources 40
  • 41. Q&A