SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Introduction
                              Corpus
                        Identification
                          Conclusions




To Be or Not To Be a Zero Pronoun?
  A Machine Learning Approach For Romanian


Claudiu Mih˘il˘1
           a a                  Iustina Ilisei2        Diana Inkpen3

                   1 Faculty of Computer Science,

              ”Alexandru Ioan Cuza” University of Ia¸i
                                                    s
 2 Research   Institute in Information and Language Processing,
                    University of Wolverhampton
     3 School   of Information Technology and Engineering,
                        University of Ottawa


    PROMISE, 29 March 2010, Ia¸i, Romania
                              s

              Mih˘il˘, Ilisei & Inkpen
                 a a                     Identifying Romanian Zero Pronouns
Introduction
                                      Corpus
                                Identification
                                  Conclusions


Outline

  1   Introduction
         Motivation
         Zero Subjects vs. Zero Pronouns
         Previous Work
  2   Corpus
        Annotation
        Statistics
  3   Identification
        Features
        Algorithms
        Results
  4   Conclusions

                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Motivation
                                    Corpus
                                               Zero Subjects vs. Zero Pronouns
                              Identification
                                               Previous Work
                                Conclusions


Motivation

  The problem
      Invisible anaphors
       Lack of morphological information

  Utility
       Information extraction/retrieval
       Automatic summarisation
       Machine translation
       Multiple-choice test items generation
       etc.


                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Motivation
                                    Corpus
                                               Zero Subjects vs. Zero Pronouns
                              Identification
                                               Previous Work
                                Conclusions


Motivation

  The problem
      Invisible anaphors
       Lack of morphological information

  Utility
       Information extraction/retrieval
       Automatic summarisation
       Machine translation
       Multiple-choice test items generation
       etc.


                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                                     Motivation
                                       Corpus
                                                     Zero Subjects vs. Zero Pronouns
                                 Identification
                                                     Previous Work
                                   Conclusions


Zero Subjects vs. Zero Pronouns


  Zero subjects
      The verb does not need a subject
              Plou˘.
                  a        ˆ pare r˘u de voi. Azi
                           Imi     a                                  nu-mi arde de glum˘.
                                                                                        a

  Zero pronouns
      Lexically retrievable from the inflection of the verb
      Coreferring an overt noun, noun phrase, or clause
           zp [Eu]
                 Merg la ¸coal˘.
                           s     a
           Cine a auzit s-a ˆıntors ¸i
                                    s             zp [acela]   a plecat.




                       Mih˘il˘, Ilisei & Inkpen
                          a a                        Identifying Romanian Zero Pronouns
Introduction
                                                     Motivation
                                       Corpus
                                                     Zero Subjects vs. Zero Pronouns
                                 Identification
                                                     Previous Work
                                   Conclusions


Zero Subjects vs. Zero Pronouns


  Zero subjects
      The verb does not need a subject
              Plou˘.
                  a        ˆ pare r˘u de voi. Azi
                           Imi     a                                  nu-mi arde de glum˘.
                                                                                        a

  Zero pronouns
      Lexically retrievable from the inflection of the verb
      Coreferring an overt noun, noun phrase, or clause
           zp [Eu]
                 Merg la ¸coal˘.
                           s     a
           Cine a auzit s-a ˆıntors ¸i
                                    s             zp [acela]   a plecat.




                       Mih˘il˘, Ilisei & Inkpen
                          a a                        Identifying Romanian Zero Pronouns
Introduction
                                              Motivation
                                   Corpus
                                              Zero Subjects vs. Zero Pronouns
                             Identification
                                              Previous Work
                               Conclusions


Previous Work


  For other languages
      Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009)
                   a
      Chinese: Converse (2006), Zhao & Ng (2007)
      Japanese, Korean, Portuguese, etc.

  For Romanian
      Harabagiu & Maiorano (2000)
      Pavel et al. (2006)




                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                              Motivation
                                   Corpus
                                              Zero Subjects vs. Zero Pronouns
                             Identification
                                              Previous Work
                               Conclusions


Previous Work


  For other languages
      Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009)
                   a
      Chinese: Converse (2006), Zhao & Ng (2007)
      Japanese, Korean, Portuguese, etc.

  For Romanian
      Harabagiu & Maiorano (2000)
      Pavel et al. (2006)




                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus     Annotation
                              Identification    Statistics
                                Conclusions


Annotation

  Empty XML tag with attributes
     id
      antecedent – the reference id, ’non-nominal’, or ’elliptic’
      dependent verb – the reference id
      clause type – main, coordinated, juxtaposed, or subordinated
      annotator confidence – regarding the position, high or low

  Inter-annotator agreement
      Agreement on ZP’s dependent verb: ≈ 98%
           Cohen’s Kappa Coefficient: κ ≈ 90%
      Agreement on ZP’s position in text: ≈ 90%

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus     Annotation
                              Identification    Statistics
                                Conclusions


Annotation

  Empty XML tag with attributes
     id
      antecedent – the reference id, ’non-nominal’, or ’elliptic’
      dependent verb – the reference id
      clause type – main, coordinated, juxtaposed, or subordinated
      annotator confidence – regarding the position, high or low

  Inter-annotator agreement
      Agreement on ZP’s dependent verb: ≈ 98%
           Cohen’s Kappa Coefficient: κ ≈ 90%
      Agreement on ZP’s position in text: ≈ 90%

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                   Corpus       Annotation
                             Identification      Statistics
                               Conclusions


Statistics



  Corpus size
     Overview                NT                ET         LT          ST             Overall
     No. of tokens          18690             12963     13739        3391            48783
     No. of sentences        816               574       790          253             2433
     No. of ZPs              245               172       113          251             781
     Avg. tokens/sent.      22.90             22.58     17.39        13.40           20.05
     Avg. ZP/sent.           0.30              0.30      0.14        0.99             0.32




                   Mih˘il˘, Ilisei & Inkpen
                      a a                       Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Features

  10 features
       From RACAI’s parser
           type – main, auxiliary, copulative, or modal
           mood – indicative, subjunctive, etc.
           tense – present, imperfect, past, or pluperfect
           person – first, second, or third
           number – singular or plural
           gender – masculine, feminine, or neuter
           clitic – whether clitic form or not
      Dynamically computed
           impersonality – whether strictly impersonal or not
           ’se’ – verb preceded by reflexive pronoun ’se’
      The verb class from the manual annotation

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Features

  10 features
       From RACAI’s parser
           type – main, auxiliary, copulative, or modal
           mood – indicative, subjunctive, etc.
           tense – present, imperfect, past, or pluperfect
           person – first, second, or third
           number – singular or plural
           gender – masculine, feminine, or neuter
           clitic – whether clitic form or not
      Dynamically computed
           impersonality – whether strictly impersonal or not
           ’se’ – verb preceded by reflexive pronoun ’se’
      The verb class from the manual annotation

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Features

  10 features
       From RACAI’s parser
           type – main, auxiliary, copulative, or modal
           mood – indicative, subjunctive, etc.
           tense – present, imperfect, past, or pluperfect
           person – first, second, or third
           number – singular or plural
           gender – masculine, feminine, or neuter
           clitic – whether clitic form or not
      Dynamically computed
           impersonality – whether strictly impersonal or not
           ’se’ – verb preceded by reflexive pronoun ’se’
      The verb class from the manual annotation

                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                              Features
                                   Corpus
                                              Algorithms
                             Identification
                                              Results
                               Conclusions


Algorithms


  Weka classifiers
     SMO – implementation of SVM
      Jrip – implementation of decision rules
      J48 – implementation of decision trees
      Vote – majority-voting meta-classifier on previous three

  Data set
      781 verbs with a ZP
      781 randomly selected verbs without a ZP
      10-fold cross validation


                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                              Features
                                   Corpus
                                              Algorithms
                             Identification
                                              Results
                               Conclusions


Algorithms


  Weka classifiers
     SMO – implementation of SVM
      Jrip – implementation of decision rules
      J48 – implementation of decision trees
      Vote – majority-voting meta-classifier on previous three

  Data set
      781 verbs with a ZP
      781 randomly selected verbs without a ZP
      10-fold cross validation


                   Mih˘il˘, Ilisei & Inkpen
                      a a                     Identifying Romanian Zero Pronouns
Introduction
                                                 Features
                                      Corpus
                                                 Algorithms
                                Identification
                                                 Results
                                  Conclusions


Results



  Classifier results
                                    has ZP                              not ZP
    Class.    Acc.
                         P             R          F1            P          R           F1
    SMO      0.739     0.684         0.889       0.773        0.841      0.590        0.694
    Jrip     0.733     0.709         0.793       0.748        0.765      0.675        0.717
    J48      0.720     0.698         0.777       0.735        0.749      0.663        0.703
    Vote     0.733     0.705         0.802       0.750        0.770      0.665        0.713




                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                               Features
                                    Corpus
                                               Algorithms
                              Identification
                                               Results
                                Conclusions


Results

  Attribute evaluation
                 Attribute                ChiSquare         InfoGain
                 Mood                       402.546            0.206
                 ’Se’                        25.719            0.012
                 Person                      21.217            0.010
                 Impersonality               12.092            0.007
                 Tense                        9.371            0.004
                 Type                         2.577            0.001
                 Number                       0.354             1E-4
                 Gender                        7E-4             3E-7
                 Clitic                           0                0



                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                                 Features
                                      Corpus
                                                 Algorithms
                                Identification
                                                 Results
                                  Conclusions


Results



  Error analysis
       Ambiguity:
              E greu f˘r˘ bani.
                      aa
           E greu de scris o carte.
              Se ˆ
                 ıntunec˘ la ora cinci.
                        a
           El se ˆ
                 ıntunec˘ la fat˘.
                        a      ¸a
      Parser errors




                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                                 Features
                                      Corpus
                                                 Algorithms
                                Identification
                                                 Results
                                  Conclusions


Results



  Error analysis
       Ambiguity:
              E greu f˘r˘ bani.
                      aa
           E greu de scris o carte.
              Se ˆ
                 ıntunec˘ la ora cinci.
                        a
           El se ˆ
                 ıntunec˘ la fat˘.
                        a      ¸a
      Parser errors




                      Mih˘il˘, Ilisei & Inkpen
                         a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus
                              Identification
                                Conclusions


Conclusions


  Summary
     RoZP, a corpus with manually annotated ZPs
      Identification of over 70% of ZPs using ML methods

  Outlook
      Improve the identification accuracy
           other features – no. of verbs in sentence
           syntactic information?
      Resolve the identified ZPs



                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                                    Corpus
                              Identification
                                Conclusions


Conclusions


  Summary
     RoZP, a corpus with manually annotated ZPs
      Identification of over 70% of ZPs using ML methods

  Outlook
      Improve the identification accuracy
           other features – no. of verbs in sentence
           syntactic information?
      Resolve the identified ZPs



                    Mih˘il˘, Ilisei & Inkpen
                       a a                     Identifying Romanian Zero Pronouns
Introduction
                             Corpus
                       Identification
                         Conclusions




Thank you!
Questions?




             Mih˘il˘, Ilisei & Inkpen
                a a                     Identifying Romanian Zero Pronouns

Weitere ähnliche Inhalte

Andere mochten auch

Zemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineZemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineClaudiu Mihăilă
 
Grammar book
Grammar bookGrammar book
Grammar booknkrinder
 
Impersonal constructions with se
Impersonal constructions with seImpersonal constructions with se
Impersonal constructions with sespanishtutor
 
Grammer complete Notes
Grammer complete NotesGrammer complete Notes
Grammer complete NotesMalik Sajjad
 
'It' as Impersonal Subject
'It' as Impersonal Subject'It' as Impersonal Subject
'It' as Impersonal SubjecttheLecturette
 
Français Niveau Intermédiare 600 exercices
 Français Niveau Intermédiare 600 exercices Français Niveau Intermédiare 600 exercices
Français Niveau Intermédiare 600 exercicesBetty Ingrid
 
5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_book5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_bookMaria José Silva
 
GRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPEGRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPEBetty Ingrid
 
Emphatic Structures
Emphatic StructuresEmphatic Structures
Emphatic StructuresVQuevedo
 
Clarke, simon english grammar in context essential
Clarke, simon   english grammar in context essentialClarke, simon   english grammar in context essential
Clarke, simon english grammar in context essentialTelma Ventura
 
have-fun-with-vocabulary
 have-fun-with-vocabulary have-fun-with-vocabulary
have-fun-with-vocabularykaticat
 
Wh Cleft Sentences
Wh Cleft SentencesWh Cleft Sentences
Wh Cleft SentencesJANA CIOBANU
 

Andere mochten auch (20)

Zemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation EngineZemanta: A Content Recommendation Engine
Zemanta: A Content Recommendation Engine
 
Grammar book
Grammar bookGrammar book
Grammar book
 
Grammer book1
Grammer book1Grammer book1
Grammer book1
 
Grammar book
Grammar bookGrammar book
Grammar book
 
Impersonal constructions with se
Impersonal constructions with seImpersonal constructions with se
Impersonal constructions with se
 
Pronouns - English Grammar
Pronouns - English GrammarPronouns - English Grammar
Pronouns - English Grammar
 
Grammer complete Notes
Grammer complete NotesGrammer complete Notes
Grammer complete Notes
 
Cleft sentences
Cleft sentencesCleft sentences
Cleft sentences
 
'It' as Impersonal Subject
'It' as Impersonal Subject'It' as Impersonal Subject
'It' as Impersonal Subject
 
Français Niveau Intermédiare 600 exercices
 Français Niveau Intermédiare 600 exercices Français Niveau Intermédiare 600 exercices
Français Niveau Intermédiare 600 exercices
 
Study abroad
Study abroad Study abroad
Study abroad
 
Emphatic form
Emphatic formEmphatic form
Emphatic form
 
5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_book5 complete first_certificate_teacher_39_s_book
5 complete first_certificate_teacher_39_s_book
 
What is Syntax?
What is Syntax?What is Syntax?
What is Syntax?
 
GRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPEGRAMMAR AND VOCABULARY FOR CAE AND CPE
GRAMMAR AND VOCABULARY FOR CAE AND CPE
 
Emphatic Structures
Emphatic StructuresEmphatic Structures
Emphatic Structures
 
Clarke, simon english grammar in context essential
Clarke, simon   english grammar in context essentialClarke, simon   english grammar in context essential
Clarke, simon english grammar in context essential
 
have-fun-with-vocabulary
 have-fun-with-vocabulary have-fun-with-vocabulary
have-fun-with-vocabulary
 
Wh Cleft Sentences
Wh Cleft SentencesWh Cleft Sentences
Wh Cleft Sentences
 
Oxford Grammar And Vocabulary
Oxford Grammar And VocabularyOxford Grammar And Vocabulary
Oxford Grammar And Vocabulary
 

Mehr von Claudiu Mihăilă

News Search Using Discourse Analytics
News Search Using Discourse AnalyticsNews Search Using Discourse Analytics
News Search Using Discourse AnalyticsClaudiu Mihăilă
 
Analysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical SubdomainsAnalysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical SubdomainsClaudiu Mihăilă
 
Translation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation UniversalsTranslation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation UniversalsClaudiu Mihăilă
 
Simplification and Explicitation Universals
Simplification and Explicitation UniversalsSimplification and Explicitation Universals
Simplification and Explicitation UniversalsClaudiu Mihăilă
 
Functional Dependency Grammar
Functional Dependency GrammarFunctional Dependency Grammar
Functional Dependency GrammarClaudiu Mihăilă
 
TEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnTEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnClaudiu Mihăilă
 
Nature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebNature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebClaudiu Mihăilă
 
Modelling social Web applications via tinydb
Modelling social Web applications via tinydbModelling social Web applications via tinydb
Modelling social Web applications via tinydbClaudiu Mihăilă
 

Mehr von Claudiu Mihăilă (8)

News Search Using Discourse Analytics
News Search Using Discourse AnalyticsNews Search Using Discourse Analytics
News Search Using Discourse Analytics
 
Analysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical SubdomainsAnalysing Entity Type Variation across Biomedical Subdomains
Analysing Entity Type Variation across Biomedical Subdomains
 
Translation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation UniversalsTranslation studies: Simplification and Explicitation Universals
Translation studies: Simplification and Explicitation Universals
 
Simplification and Explicitation Universals
Simplification and Explicitation UniversalsSimplification and Explicitation Universals
Simplification and Explicitation Universals
 
Functional Dependency Grammar
Functional Dependency GrammarFunctional Dependency Grammar
Functional Dependency Grammar
 
TEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition YarnTEDDY - Thesaurus Editor: Design and Definition Yarn
TEDDY - Thesaurus Editor: Design and Definition Yarn
 
Nature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic WebNature-inspired methods for the Semantic Web
Nature-inspired methods for the Semantic Web
 
Modelling social Web applications via tinydb
Modelling social Web applications via tinydbModelling social Web applications via tinydb
Modelling social Web applications via tinydb
 

Kürzlich hochgeladen

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Kürzlich hochgeladen (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian

  • 1. Introduction Corpus Identification Conclusions To Be or Not To Be a Zero Pronoun? A Machine Learning Approach For Romanian Claudiu Mih˘il˘1 a a Iustina Ilisei2 Diana Inkpen3 1 Faculty of Computer Science, ”Alexandru Ioan Cuza” University of Ia¸i s 2 Research Institute in Information and Language Processing, University of Wolverhampton 3 School of Information Technology and Engineering, University of Ottawa PROMISE, 29 March 2010, Ia¸i, Romania s Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 2. Introduction Corpus Identification Conclusions Outline 1 Introduction Motivation Zero Subjects vs. Zero Pronouns Previous Work 2 Corpus Annotation Statistics 3 Identification Features Algorithms Results 4 Conclusions Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 3. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Motivation The problem Invisible anaphors Lack of morphological information Utility Information extraction/retrieval Automatic summarisation Machine translation Multiple-choice test items generation etc. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 4. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Motivation The problem Invisible anaphors Lack of morphological information Utility Information extraction/retrieval Automatic summarisation Machine translation Multiple-choice test items generation etc. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 5. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Zero Subjects vs. Zero Pronouns Zero subjects The verb does not need a subject Plou˘. a ˆ pare r˘u de voi. Azi Imi a nu-mi arde de glum˘. a Zero pronouns Lexically retrievable from the inflection of the verb Coreferring an overt noun, noun phrase, or clause zp [Eu] Merg la ¸coal˘. s a Cine a auzit s-a ˆıntors ¸i s zp [acela] a plecat. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 6. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Zero Subjects vs. Zero Pronouns Zero subjects The verb does not need a subject Plou˘. a ˆ pare r˘u de voi. Azi Imi a nu-mi arde de glum˘. a Zero pronouns Lexically retrievable from the inflection of the verb Coreferring an overt noun, noun phrase, or clause zp [Eu] Merg la ¸coal˘. s a Cine a auzit s-a ˆıntors ¸i s zp [acela] a plecat. Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 7. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Previous Work For other languages Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009) a Chinese: Converse (2006), Zhao & Ng (2007) Japanese, Korean, Portuguese, etc. For Romanian Harabagiu & Maiorano (2000) Pavel et al. (2006) Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 8. Introduction Motivation Corpus Zero Subjects vs. Zero Pronouns Identification Previous Work Conclusions Previous Work For other languages Spanish: Ferr´ndez & Peral (2000), Rello & Ilisei (2009) a Chinese: Converse (2006), Zhao & Ng (2007) Japanese, Korean, Portuguese, etc. For Romanian Harabagiu & Maiorano (2000) Pavel et al. (2006) Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 9. Introduction Corpus Annotation Identification Statistics Conclusions Annotation Empty XML tag with attributes id antecedent – the reference id, ’non-nominal’, or ’elliptic’ dependent verb – the reference id clause type – main, coordinated, juxtaposed, or subordinated annotator confidence – regarding the position, high or low Inter-annotator agreement Agreement on ZP’s dependent verb: ≈ 98% Cohen’s Kappa Coefficient: κ ≈ 90% Agreement on ZP’s position in text: ≈ 90% Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 10. Introduction Corpus Annotation Identification Statistics Conclusions Annotation Empty XML tag with attributes id antecedent – the reference id, ’non-nominal’, or ’elliptic’ dependent verb – the reference id clause type – main, coordinated, juxtaposed, or subordinated annotator confidence – regarding the position, high or low Inter-annotator agreement Agreement on ZP’s dependent verb: ≈ 98% Cohen’s Kappa Coefficient: κ ≈ 90% Agreement on ZP’s position in text: ≈ 90% Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 11. Introduction Corpus Annotation Identification Statistics Conclusions Statistics Corpus size Overview NT ET LT ST Overall No. of tokens 18690 12963 13739 3391 48783 No. of sentences 816 574 790 253 2433 No. of ZPs 245 172 113 251 781 Avg. tokens/sent. 22.90 22.58 17.39 13.40 20.05 Avg. ZP/sent. 0.30 0.30 0.14 0.99 0.32 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 12. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 13. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 14. Introduction Features Corpus Algorithms Identification Results Conclusions Features 10 features From RACAI’s parser type – main, auxiliary, copulative, or modal mood – indicative, subjunctive, etc. tense – present, imperfect, past, or pluperfect person – first, second, or third number – singular or plural gender – masculine, feminine, or neuter clitic – whether clitic form or not Dynamically computed impersonality – whether strictly impersonal or not ’se’ – verb preceded by reflexive pronoun ’se’ The verb class from the manual annotation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 15. Introduction Features Corpus Algorithms Identification Results Conclusions Algorithms Weka classifiers SMO – implementation of SVM Jrip – implementation of decision rules J48 – implementation of decision trees Vote – majority-voting meta-classifier on previous three Data set 781 verbs with a ZP 781 randomly selected verbs without a ZP 10-fold cross validation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 16. Introduction Features Corpus Algorithms Identification Results Conclusions Algorithms Weka classifiers SMO – implementation of SVM Jrip – implementation of decision rules J48 – implementation of decision trees Vote – majority-voting meta-classifier on previous three Data set 781 verbs with a ZP 781 randomly selected verbs without a ZP 10-fold cross validation Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 17. Introduction Features Corpus Algorithms Identification Results Conclusions Results Classifier results has ZP not ZP Class. Acc. P R F1 P R F1 SMO 0.739 0.684 0.889 0.773 0.841 0.590 0.694 Jrip 0.733 0.709 0.793 0.748 0.765 0.675 0.717 J48 0.720 0.698 0.777 0.735 0.749 0.663 0.703 Vote 0.733 0.705 0.802 0.750 0.770 0.665 0.713 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 18. Introduction Features Corpus Algorithms Identification Results Conclusions Results Attribute evaluation Attribute ChiSquare InfoGain Mood 402.546 0.206 ’Se’ 25.719 0.012 Person 21.217 0.010 Impersonality 12.092 0.007 Tense 9.371 0.004 Type 2.577 0.001 Number 0.354 1E-4 Gender 7E-4 3E-7 Clitic 0 0 Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 19. Introduction Features Corpus Algorithms Identification Results Conclusions Results Error analysis Ambiguity: E greu f˘r˘ bani. aa E greu de scris o carte. Se ˆ ıntunec˘ la ora cinci. a El se ˆ ıntunec˘ la fat˘. a ¸a Parser errors Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 20. Introduction Features Corpus Algorithms Identification Results Conclusions Results Error analysis Ambiguity: E greu f˘r˘ bani. aa E greu de scris o carte. Se ˆ ıntunec˘ la ora cinci. a El se ˆ ıntunec˘ la fat˘. a ¸a Parser errors Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 21. Introduction Corpus Identification Conclusions Conclusions Summary RoZP, a corpus with manually annotated ZPs Identification of over 70% of ZPs using ML methods Outlook Improve the identification accuracy other features – no. of verbs in sentence syntactic information? Resolve the identified ZPs Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 22. Introduction Corpus Identification Conclusions Conclusions Summary RoZP, a corpus with manually annotated ZPs Identification of over 70% of ZPs using ML methods Outlook Improve the identification accuracy other features – no. of verbs in sentence syntactic information? Resolve the identified ZPs Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns
  • 23. Introduction Corpus Identification Conclusions Thank you! Questions? Mih˘il˘, Ilisei & Inkpen a a Identifying Romanian Zero Pronouns