SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Reuse of Free
                                                     Resources in
                                                   Nynorsk↔Bokmål
                                                         MT
                                                   Kevin Unhammer,
  Reuse of Free Resources in Machine                Trond Trosterud


Translation between Nynorsk and Bokmål            Introduction
                                                  Nynorsk and Bokmål
                                                  Norwegian language resources


                                                  The Apertium
                                                  architecture and
     Kevin Unhammer1           Trond Trosterud2   nn-nb pipeline
                                                  Constraint Grammar


            1                                     Developing
                Department of Linguistics         apertium-nn-nb
                 University of Bergen             Disambiguation and CG
                                                  conversion
                   Bergen, Norway                 Translation dictionary
                  kun041@student.uib.no           Structural transfer


            2                                     Evaluation
                Department of Linguistics
                                                  Coverage
                 University of Tromsø             WER and B LEU

                   Tromsø, Norway
                                                  Future work
                 trond.trosterud@uit.no



            2nd November 2009
Reuse of Free
Outline of talk                                    Resources in
                                                 Nynorsk↔Bokmål
                                                       MT
                                                 Kevin Unhammer,
 Introduction                                     Trond Trosterud

     Nynorsk and Bokmål
                                                Introduction
     Norwegian language resources               Nynorsk and Bokmål
                                                Norwegian language resources


                                                The Apertium
 The Apertium architecture and nn-nb pipeline   architecture and
                                                nn-nb pipeline
    Constraint Grammar                          Constraint Grammar


                                                Developing
 Developing apertium-nn-nb                      apertium-nn-nb
                                                Disambiguation and CG
    Disambiguation and CG conversion            conversion
                                                Translation dictionary
    Translation dictionary                      Structural transfer


    Structural transfer                         Evaluation
                                                Coverage
                                                WER and B LEU

 Evaluation                                     Future work
    Coverage
    WER and B LEU

 Future work
Reuse of Free
The Norwegian language(s)                                                   Resources in
                                                                          Nynorsk↔Bokmål
                                                                                MT
                                                                          Kevin Unhammer,
    A lot of dialectal variation                                           Trond Trosterud


    Two written variants:                                                Introduction
                                                                         Nynorsk and Bokmål
         Bokmål                                                          Norwegian language resources


               Based on Danish and the Dano-Norwegian koiné of the       The Apertium
                                                                         architecture and
               major cities in the 1800’s                                nn-nb pipeline
         Nynorsk                                                         Constraint Grammar


                                                                         Developing
               Based on the spoken dialects of Norway, standardised by
                                                                         apertium-nn-nb
               linguist Ivar Aasen in the late 1800’s                    Disambiguation and CG
                                                                         conversion

    Nynorsk used by around 12% of the population                         Translation dictionary
                                                                         Structural transfer


    “Language-friendly” politics: Both standards are officially           Evaluation
                                                                         Coverage
    recognised and both are taught in school from age 12 and             WER and B LEU


    up                                                                   Future work


    Both Nynorsk and Bokmål allow quite a lot of variation,
    with some choices being considered more “radical” or
    “conservative” than others
Reuse of Free
Free, Open Source Norwegian language                                     Resources in
                                                                       Nynorsk↔Bokmål
                                                                             MT
resources                                                              Kevin Unhammer,
                                                                        Trond Trosterud


                                                                      Introduction
                                                                      Nynorsk and Bokmål
                                                                      Norwegian language resources


    Norsk Ordbank                                                     The Apertium
                                                                      architecture and
         full form dictionaries for Nynorsk and Bokmål; 106,789 and   nn-nb pipeline
                                                                      Constraint Grammar
         142,899 lemmas, respectively
                                                                      Developing
    The Oslo–Bergen tagger                                            apertium-nn-nb
                                                                      Disambiguation and CG

         Constraint Grammar morphological disambiguation              conversion
                                                                      Translation dictionary

         Constraint Grammar syntactic dependency parser               Structural transfer


         Various other modules (compounding, NER, . . . )             Evaluation
                                                                      Coverage

    No freely available bilingual dictionary between Nynorsk          WER and B LEU


                                                                      Future work
    and Bokmål, until now. . .
Reuse of Free
The apertium-nn-nb pipeline                                                 Resources in
                                                                          Nynorsk↔Bokmål
                                                                                MT
                                                                          Kevin Unhammer,
                                                                           Trond Trosterud


    Morphological analysis                                               Introduction
                                                                         Nynorsk and Bokmål

         lttoolbox: XML format, compiles to very fast FSTs               Norwegian language resources


         one XML dictionary gives both analysis and generation           The Apertium
                                                                         architecture and
                                                                         nn-nb pipeline
    CG pre-disambiguation                                                Constraint Grammar


    Statistical disambiguation (HMM)                                     Developing
                                                                         apertium-nn-nb

    Bilingual dictionary for lexical transfer                            Disambiguation and CG
                                                                         conversion
                                                                         Translation dictionary
    Shallow syntactic transfer rules                                     Structural transfer


         Local re-ordering (det noun → noun det)                         Evaluation
                                                                         Coverage
         Insertions, deletions and substitutions of lexical units (and   WER and B LEU

         chunks, but we don’t use them yet)                              Future work

    Morphological generation (again with lttoolbox)
Reuse of Free
Constraint Grammar                                                    Resources in
                                                                    Nynorsk↔Bokmål
                                                                          MT
                                                                    Kevin Unhammer,
    Rules work on ambiguous input and may SELECT one                 Trond Trosterud

    analysis over all others, or REMOVE one analysis from the      Introduction

    set of analyses, or ADD a new tag, etc.                        Nynorsk and Bokmål
                                                                   Norwegian language resources


    Often thousands of short, hand-written rules                   The Apertium
                                                                   architecture and
    Rules apply based on “context conditions”:                     nn-nb pipeline
                                                                   Constraint Grammar

        (-1* noun) means “there must be word with a noun           Developing
         analysis somewhere to the left”                           apertium-nn-nb
                                                                   Disambiguation and CG
         (1C* verb) means “there must be a word disambiguated      conversion
                                                                   Translation dictionary
         to a verb somewhere to the right”                         Structural transfer


         (1* verb LINK 2 noun) means “there must be a              Evaluation

         verb-analysis to the right, and a noun-analysis two       Coverage
                                                                   WER and B LEU

         positions to the right of that”                           Future work
         (1* verb BARRIER noun) means “there must be a
         verb-analysis to the right, and no noun-analyses before
         that”
         There are many other possibilities. . .
Reuse of Free
Example of a CG rule                                             Resources in
                                                               Nynorsk↔Bokmål
                                                                     MT
                                                               Kevin Unhammer,
                                                                Trond Trosterud


                                                              Introduction

If input contains the word ‘walks’ analysed as either         Nynorsk and Bokmål
                                                              Norwegian language resources

verb 3sg present or noun pl, the following rule               The Apertium
                                                              architecture and
                                                              nn-nb pipeline

SELECT (verb 3sg present) IF                                  Constraint Grammar


                                                              Developing
   (-1*C 3sg BARRIER verb)                                    apertium-nn-nb
                                                              Disambiguation and CG
   (NOT -1 det);                                              conversion
                                                              Translation dictionary
                                                              Structural transfer


would choose the verb analysis if there is a disambiguated    Evaluation
                                                              Coverage
word, analysed as third singular, to the left, with no verb   WER and B LEU


between the two; and there is no determiner to the left       Future work
Reuse of Free
Development of apertium-nn-nb                                       Resources in
                                                                  Nynorsk↔Bokmål
                                                                        MT
                                                                  Kevin Unhammer,
                                                                   Trond Trosterud


                                                                 Introduction
                                                                 Nynorsk and Bokmål
                                                                 Norwegian language resources


                                                                 The Apertium
    Most of the work done within 12 weeks (Google Summer of      architecture and
                                                                 nn-nb pipeline
    Code 2009)                                                   Constraint Grammar


    Helped by high quality free resources                        Developing
                                                                 apertium-nn-nb
        Monolingual dictionaries: Norsk Ordbank converted from   Disambiguation and CG
                                                                 conversion
        full form listing to lttoolbox format                    Translation dictionary

        CG: Oslo–Bergen tagger converted to use Apertium tag     Structural transfer


                                                                 Evaluation
        scheme
                                                                 Coverage
                                                                 WER and B LEU


                                                                 Future work
Reuse of Free
Disambiguation and CG conversion                                       Resources in
                                                                     Nynorsk↔Bokmål
                                                                           MT
                                                                     Kevin Unhammer,
                                                                      Trond Trosterud


    Bigram HMM’s trained on Wikipedia text (Baum-Welch, 8           Introduction
                                                                    Nynorsk and Bokmål

    iterations)                                                     Norwegian language resources


                                                                    The Apertium
    Conversion of CG tag set mostly done within a few days          architecture and
                                                                    nn-nb pipeline
    Errors fixed in CG reported back to Oslo–Bergen tagger           Constraint Grammar


    team, win-win.                                                  Developing
                                                                    apertium-nn-nb
    However: the Oslo–Bergen tagger was designed for                Disambiguation and CG
                                                                    conversion

    corpus annotation and lexicography                              Translation dictionary
                                                                    Structural transfer

        For the linguist, recall is more important than precision   Evaluation
        For (our) MT, only one analysis matters                     Coverage
                                                                    WER and B LEU
        So we need to take more chances with our rules
                                                                    Future work
        Also, we get some MT-specific rules (like CG-based lexical
        selection)
Reuse of Free
Finding word translations semi-automatically                                 Resources in
                                                                           Nynorsk↔Bokmål
                                                                                 MT
                                                                           Kevin Unhammer,
    Method 1: Exact matches where the morphology is the                     Trond Trosterud

    same                                                                  Introduction

        If lemma and morphological possibilities are the same,            Nynorsk and Bokmål
                                                                          Norwegian language resources

        assume we have a translation
                                                                          The Apertium
             ‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both   architecture and
                                                                          nn-nb pipeline
             monolingual dictionaries; add it as a translation            Constraint Grammar


        36,000 entries (although quite a lot are low-frequency /          Developing
                                                                          apertium-nn-nb
        loan-words)                                                       Disambiguation and CG

        Risk of “radical forms”                                           conversion
                                                                          Translation dictionary
                                                                          Structural transfer


                                                                          Evaluation
                                                                          Coverage
                                                                          WER and B LEU


                                                                          Future work
Reuse of Free
Finding word translations semi-automatically                                 Resources in
                                                                           Nynorsk↔Bokmål
                                                                                 MT
                                                                           Kevin Unhammer,
    Method 1: Exact matches where the morphology is the                     Trond Trosterud

    same                                                                  Introduction

        If lemma and morphological possibilities are the same,            Nynorsk and Bokmål
                                                                          Norwegian language resources

        assume we have a translation
                                                                          The Apertium
             ‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both   architecture and
                                                                          nn-nb pipeline
             monolingual dictionaries; add it as a translation            Constraint Grammar


        36,000 entries (although quite a lot are low-frequency /          Developing
                                                                          apertium-nn-nb
        loan-words)                                                       Disambiguation and CG

        Risk of “radical forms”                                           conversion
                                                                          Translation dictionary


    Method 2: Predictable substring-translations                          Structural transfer


                                                                          Evaluation
        find Bokmål entries without translations                           Coverage

        run string replacements for typical differences                   WER and B LEU



        (-hjem-→-heim-, -lig→-leg, . . . )                                Future work

        check if the altered entries are in the Nynorsk analyser
        . . . and vice versa
        Main run gave 2500 good entries
Reuse of Free
Expanding the translational dictionary using                       Resources in
                                                                 Nynorsk↔Bokmål
                                                                       MT
alignments                                                       Kevin Unhammer,
                                                                  Trond Trosterud


                                                                Introduction
                                                                Nynorsk and Bokmål
                                                                Norwegian language resources
    Method 3: Automatic word aligments
                                                                The Apertium
        Corpora:                                                architecture and
                                                                nn-nb pipeline
             KDE4 software translations (400,000 words)         Constraint Grammar

             government web pages (50,000 words, crawled with   Developing
             bitextor)                                          apertium-nn-nb
                                                                Disambiguation and CG
        po-terminology (only on KDE4)                           conversion
                                                                Translation dictionary
             gave some hundreds of new terms                    Structural transfer


        morphological tagging → Giza++ → ReTraTos               Evaluation
                                                                Coverage
             about 3500 entries                                 WER and B LEU

             Lots of cleaning needed                            Future work
Reuse of Free
Expanding the translational dictionary using                       Resources in
                                                                 Nynorsk↔Bokmål
                                                                       MT
alignments                                                       Kevin Unhammer,
                                                                  Trond Trosterud


                                                                Introduction
                                                                Nynorsk and Bokmål
                                                                Norwegian language resources
    Method 3: Automatic word aligments
                                                                The Apertium
        Corpora:                                                architecture and
                                                                nn-nb pipeline
             KDE4 software translations (400,000 words)         Constraint Grammar

             government web pages (50,000 words, crawled with   Developing
             bitextor)                                          apertium-nn-nb
                                                                Disambiguation and CG
        po-terminology (only on KDE4)                           conversion
                                                                Translation dictionary
             gave some hundreds of new terms                    Structural transfer


        morphological tagging → Giza++ → ReTraTos               Evaluation
                                                                Coverage
             about 3500 entries                                 WER and B LEU

             Lots of cleaning needed                            Future work

    Method 4: User-contributed entries (via Wikipedia)
Reuse of Free
Structural transfer                                               Resources in
                                                                Nynorsk↔Bokmål
                                                                      MT
                                                                Kevin Unhammer,
                                                                 Trond Trosterud


     Finite passive verbs                                      Introduction
                                                               Nynorsk and Bokmål
                                                               Norwegian language resources

 (1) a. Bevilgning gis             oftest ikke                 The Apertium
                                                               architecture and
       grant.IND give.PRES. PASS usually not                   nn-nb pipeline
                                                               Constraint Grammar
    b. Løyve     blir oftast ikkje gjeve
                                                               Developing
       grant.IND AUX usually not give.PART                     apertium-nn-nb

       ‘Grants are usually not given’                          Disambiguation and CG
                                                               conversion
                                                               Translation dictionary
    c. Om høsten fylles           fjorden med sild             Structural transfer


       In fall.DEF fill.PRES. PASS fjord.DEF with herring       Evaluation
                                                               Coverage
    d. Om hausten blir fjorden fylt             med sild       WER and B LEU


       In fall.DEF AUX fjord.DEF fill.PRES. PASS with herring   Future work

       ‘In fall, the fjord is filled with herring’
Reuse of Free
Structural transfer                                      Resources in
                                                       Nynorsk↔Bokmål
                                                             MT
                                                       Kevin Unhammer,
                                                        Trond Trosterud


      Genitive noun phrases                           Introduction
                                                      Nynorsk and Bokmål
                                                      Norwegian language resources

 (2) a. forfatterens     siste utgivelse              The Apertium
                                                      architecture and
        author.DEF. GEN last publication.IND          nn-nb pipeline
                                                      Constraint Grammar
     b. den siste utgjevinga        til forfattaren
                                                      Developing
        the last publication.DEF of author.DEF        apertium-nn-nb

        ‘the author’s last publication’               Disambiguation and CG
                                                      conversion
                                                      Translation dictionary
     c. mitt nye luftputefartøy                       Structural transfer


        my new hovercraft.IND                         Evaluation
                                                      Coverage
     d. det nye luftputefartøyet mitt                 WER and B LEU


        the new hovercraft.DEF mine                   Future work

        ‘my new hovercraft’
Reuse of Free
Evaluation        Resources in
                Nynorsk↔Bokmål
                      MT
                Kevin Unhammer,
                 Trond Trosterud


               Introduction
               Nynorsk and Bokmål
               Norwegian language resources


               The Apertium
               architecture and
               nn-nb pipeline
    Coverage   Constraint Grammar


               Developing
    WER        apertium-nn-nb
               Disambiguation and CG

    B LEU      conversion
               Translation dictionary
               Structural transfer


               Evaluation
               Coverage
               WER and B LEU


               Future work
Reuse of Free
Coverage                                                          Resources in
                                                                Nynorsk↔Bokmål
                                                                      MT
                                                                Kevin Unhammer,
                                                                 Trond Trosterud


                                                               Introduction
                                                               Nynorsk and Bokmål
                                                               Norwegian language resources


                                                               The Apertium
   Naïve coverage on Nynorsk Wikipedia: 89.6%                  architecture and
                                                               nn-nb pipeline
   Naïve coverage on Bokmål Wikipedia: 88.2%                   Constraint Grammar


                                                               Developing
   Coverage seems to be the most important issue:              apertium-nn-nb
                                                               Disambiguation and CG
   Not only is every 10th word untranslated, but we get        conversion
                                                               Translation dictionary
   disambiguation problems and transfer problems in the rest   Structural transfer


   of the sentence                                             Evaluation
                                                               Coverage
                                                               WER and B LEU


                                                               Future work
Reuse of Free
WER and B LEU scores in the nb→nn direction                           Resources in
                                                                    Nynorsk↔Bokmål
                                                                          MT
                                                                    Kevin Unhammer,
                                                                     Trond Trosterud
     Word Error Rate, B LEU and Unknown Word Rate on text
     from government web pages                                     Introduction
                                                                   Nynorsk and Bokmål
                                                                   Norwegian language resources


                                                                   The Apertium
                  B LEU        WERO          WERW        UWR       architecture and
                                                                   nn-nb pipeline
     Apertium      0.74    32.5 (36.1)    17.7 (50.5)     9.5      Constraint Grammar


     Nyno          0.85    29.1 (34.6)    13.3 (47.3)     0.8      Developing
                                                                   apertium-nn-nb
                                                                   Disambiguation and CG
Table: B LEU score (two reference translations) and WER (for the   conversion
                                                                   Translation dictionary
Original and Wikipedia references). Numbers in parenthesis give    Structural transfer

percentage of unknown words which were free-rides.                 Evaluation
                                                                   Coverage
                                                                   WER and B LEU


                                                                   Future work
     WER on post-edited Apertium MT output on a Wikipedia
     article, however, was 10.71% (64.93% free-rides)
     Coverage seems like the major difference.
Reuse of Free
Future work                                                 Resources in
                                                          Nynorsk↔Bokmål
                                                                MT
                                                          Kevin Unhammer,
                                                           Trond Trosterud
    Compounding
                                                         Introduction
    (3)   a.   bilkirkegård → bilkyrkjegard              Nynorsk and Bokmål
                                                         Norwegian language resources

               car.cemetery → car.cemetery               The Apertium
                                                         architecture and
          b.   postordrelager     → #postordrelagar      nn-nb pipeline

               mail.order.storage → mail.order.creator   Constraint Grammar


                                                         Developing
                                                         apertium-nn-nb
                                                         Disambiguation and CG
                                                         conversion
                                                         Translation dictionary
                                                         Structural transfer


                                                         Evaluation
                                                         Coverage
                                                         WER and B LEU


                                                         Future work
Reuse of Free
Future work                                                 Resources in
                                                          Nynorsk↔Bokmål
                                                                MT
                                                          Kevin Unhammer,
                                                           Trond Trosterud
    Compounding
                                                         Introduction
    (3)   a.   bilkirkegård → bilkyrkjegard              Nynorsk and Bokmål
                                                         Norwegian language resources

               car.cemetery → car.cemetery               The Apertium
                                                         architecture and
          b.   postordrelager     → #postordrelagar      nn-nb pipeline

               mail.order.storage → mail.order.creator   Constraint Grammar


                                                         Developing
                                                         apertium-nn-nb
    Multi-word expressions                               Disambiguation and CG
                                                         conversion
                                                         Translation dictionary

    (4)   a.   Han anbefalte    meg å gå hjem            Structural transfer



               he recommended me INF go home             Evaluation
                                                         Coverage

          b.   Han rådte     meg til å gå heim           WER and B LEU


                                                         Future work
               he counseled me to INF go home
               ‘He recommended that I go home’
Reuse of Free
Future work                                                 Resources in
                                                          Nynorsk↔Bokmål
                                                                MT
                                                          Kevin Unhammer,
                                                           Trond Trosterud
    Compounding
                                                         Introduction
    (3)   a.   bilkirkegård → bilkyrkjegard              Nynorsk and Bokmål
                                                         Norwegian language resources

               car.cemetery → car.cemetery               The Apertium
                                                         architecture and
          b.   postordrelager     → #postordrelagar      nn-nb pipeline

               mail.order.storage → mail.order.creator   Constraint Grammar


                                                         Developing
                                                         apertium-nn-nb
    Multi-word expressions                               Disambiguation and CG
                                                         conversion
                                                         Translation dictionary

    (4)   a.   Han anbefalte    meg å gå hjem            Structural transfer



               he recommended me INF go home             Evaluation
                                                         Coverage

          b.   Han rådte     meg til å gå heim           WER and B LEU


                                                         Future work
               he counseled me to INF go home
               ‘He recommended that I go home’

    Expanding the Scandinavian language group
Reuse of Free
                           Resources in
                         Nynorsk↔Bokmål
                               MT
                         Kevin Unhammer,
                          Trond Trosterud


                        Introduction
                        Nynorsk and Bokmål
                        Norwegian language resources


                        The Apertium
                        architecture and

Thanks for listening!   nn-nb pipeline
                        Constraint Grammar


                        Developing
                        apertium-nn-nb
                        Disambiguation and CG
                        conversion
                        Translation dictionary
                        Structural transfer


                        Evaluation
                        Coverage
                        WER and B LEU


                        Future work
Reuse of Free
Licences                                                         Resources in
                                                               Nynorsk↔Bokmål
                                                                     MT
                                                               Kevin Unhammer,
                                                                Trond Trosterud


                                                              Introduction
                                                              Nynorsk and Bokmål

This presentation may be distributed under the terms of the   Norwegian language resources


                                                              The Apertium
GNU GPL, GNU FDL and CC-BY-SA licences.                       architecture and
                                                              nn-nb pipeline
     GNU GPL v. 3.0                                           Constraint Grammar


     http://www.gnu.org/licenses/gpl.html                     Developing
                                                              apertium-nn-nb
     GNU FDL v. 1.2                                           Disambiguation and CG
                                                              conversion


     http://www.gnu.org/licenses/gfdl.html                    Translation dictionary
                                                              Structural transfer


     CC-BY-SA v. 3.0                                          Evaluation
                                                              Coverage

     http://creativecommons.org/licenses/by-sa/3.0/           WER and B LEU


                                                              Future work

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 

Kürzlich hochgeladen (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Empfohlen (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

Reuse of Free Resources in Machine Translation between Norwegian Nynorsk and Bokmål

  • 1. Reuse of Free Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Reuse of Free Resources in Machine Trond Trosterud Translation between Nynorsk and Bokmål Introduction Nynorsk and Bokmål Norwegian language resources The Apertium architecture and Kevin Unhammer1 Trond Trosterud2 nn-nb pipeline Constraint Grammar 1 Developing Department of Linguistics apertium-nn-nb University of Bergen Disambiguation and CG conversion Bergen, Norway Translation dictionary kun041@student.uib.no Structural transfer 2 Evaluation Department of Linguistics Coverage University of Tromsø WER and B LEU Tromsø, Norway Future work trond.trosterud@uit.no 2nd November 2009
  • 2. Reuse of Free Outline of talk Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Introduction Trond Trosterud Nynorsk and Bokmål Introduction Norwegian language resources Nynorsk and Bokmål Norwegian language resources The Apertium The Apertium architecture and nn-nb pipeline architecture and nn-nb pipeline Constraint Grammar Constraint Grammar Developing Developing apertium-nn-nb apertium-nn-nb Disambiguation and CG Disambiguation and CG conversion conversion Translation dictionary Translation dictionary Structural transfer Structural transfer Evaluation Coverage WER and B LEU Evaluation Future work Coverage WER and B LEU Future work
  • 3. Reuse of Free The Norwegian language(s) Resources in Nynorsk↔Bokmål MT Kevin Unhammer, A lot of dialectal variation Trond Trosterud Two written variants: Introduction Nynorsk and Bokmål Bokmål Norwegian language resources Based on Danish and the Dano-Norwegian koiné of the The Apertium architecture and major cities in the 1800’s nn-nb pipeline Nynorsk Constraint Grammar Developing Based on the spoken dialects of Norway, standardised by apertium-nn-nb linguist Ivar Aasen in the late 1800’s Disambiguation and CG conversion Nynorsk used by around 12% of the population Translation dictionary Structural transfer “Language-friendly” politics: Both standards are officially Evaluation Coverage recognised and both are taught in school from age 12 and WER and B LEU up Future work Both Nynorsk and Bokmål allow quite a lot of variation, with some choices being considered more “radical” or “conservative” than others
  • 4. Reuse of Free Free, Open Source Norwegian language Resources in Nynorsk↔Bokmål MT resources Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources Norsk Ordbank The Apertium architecture and full form dictionaries for Nynorsk and Bokmål; 106,789 and nn-nb pipeline Constraint Grammar 142,899 lemmas, respectively Developing The Oslo–Bergen tagger apertium-nn-nb Disambiguation and CG Constraint Grammar morphological disambiguation conversion Translation dictionary Constraint Grammar syntactic dependency parser Structural transfer Various other modules (compounding, NER, . . . ) Evaluation Coverage No freely available bilingual dictionary between Nynorsk WER and B LEU Future work and Bokmål, until now. . .
  • 5. Reuse of Free The apertium-nn-nb pipeline Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Morphological analysis Introduction Nynorsk and Bokmål lttoolbox: XML format, compiles to very fast FSTs Norwegian language resources one XML dictionary gives both analysis and generation The Apertium architecture and nn-nb pipeline CG pre-disambiguation Constraint Grammar Statistical disambiguation (HMM) Developing apertium-nn-nb Bilingual dictionary for lexical transfer Disambiguation and CG conversion Translation dictionary Shallow syntactic transfer rules Structural transfer Local re-ordering (det noun → noun det) Evaluation Coverage Insertions, deletions and substitutions of lexical units (and WER and B LEU chunks, but we don’t use them yet) Future work Morphological generation (again with lttoolbox)
  • 6. Reuse of Free Constraint Grammar Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Rules work on ambiguous input and may SELECT one Trond Trosterud analysis over all others, or REMOVE one analysis from the Introduction set of analyses, or ADD a new tag, etc. Nynorsk and Bokmål Norwegian language resources Often thousands of short, hand-written rules The Apertium architecture and Rules apply based on “context conditions”: nn-nb pipeline Constraint Grammar (-1* noun) means “there must be word with a noun Developing analysis somewhere to the left” apertium-nn-nb Disambiguation and CG (1C* verb) means “there must be a word disambiguated conversion Translation dictionary to a verb somewhere to the right” Structural transfer (1* verb LINK 2 noun) means “there must be a Evaluation verb-analysis to the right, and a noun-analysis two Coverage WER and B LEU positions to the right of that” Future work (1* verb BARRIER noun) means “there must be a verb-analysis to the right, and no noun-analyses before that” There are many other possibilities. . .
  • 7. Reuse of Free Example of a CG rule Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction If input contains the word ‘walks’ analysed as either Nynorsk and Bokmål Norwegian language resources verb 3sg present or noun pl, the following rule The Apertium architecture and nn-nb pipeline SELECT (verb 3sg present) IF Constraint Grammar Developing (-1*C 3sg BARRIER verb) apertium-nn-nb Disambiguation and CG (NOT -1 det); conversion Translation dictionary Structural transfer would choose the verb analysis if there is a disambiguated Evaluation Coverage word, analysed as third singular, to the left, with no verb WER and B LEU between the two; and there is no determiner to the left Future work
  • 8. Reuse of Free Development of apertium-nn-nb Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium Most of the work done within 12 weeks (Google Summer of architecture and nn-nb pipeline Code 2009) Constraint Grammar Helped by high quality free resources Developing apertium-nn-nb Monolingual dictionaries: Norsk Ordbank converted from Disambiguation and CG conversion full form listing to lttoolbox format Translation dictionary CG: Oslo–Bergen tagger converted to use Apertium tag Structural transfer Evaluation scheme Coverage WER and B LEU Future work
  • 9. Reuse of Free Disambiguation and CG conversion Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Bigram HMM’s trained on Wikipedia text (Baum-Welch, 8 Introduction Nynorsk and Bokmål iterations) Norwegian language resources The Apertium Conversion of CG tag set mostly done within a few days architecture and nn-nb pipeline Errors fixed in CG reported back to Oslo–Bergen tagger Constraint Grammar team, win-win. Developing apertium-nn-nb However: the Oslo–Bergen tagger was designed for Disambiguation and CG conversion corpus annotation and lexicography Translation dictionary Structural transfer For the linguist, recall is more important than precision Evaluation For (our) MT, only one analysis matters Coverage WER and B LEU So we need to take more chances with our rules Future work Also, we get some MT-specific rules (like CG-based lexical selection)
  • 10. Reuse of Free Finding word translations semi-automatically Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Method 1: Exact matches where the morphology is the Trond Trosterud same Introduction If lemma and morphological possibilities are the same, Nynorsk and Bokmål Norwegian language resources assume we have a translation The Apertium ‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both architecture and nn-nb pipeline monolingual dictionaries; add it as a translation Constraint Grammar 36,000 entries (although quite a lot are low-frequency / Developing apertium-nn-nb loan-words) Disambiguation and CG Risk of “radical forms” conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 11. Reuse of Free Finding word translations semi-automatically Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Method 1: Exact matches where the morphology is the Trond Trosterud same Introduction If lemma and morphological possibilities are the same, Nynorsk and Bokmål Norwegian language resources assume we have a translation The Apertium ‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both architecture and nn-nb pipeline monolingual dictionaries; add it as a translation Constraint Grammar 36,000 entries (although quite a lot are low-frequency / Developing apertium-nn-nb loan-words) Disambiguation and CG Risk of “radical forms” conversion Translation dictionary Method 2: Predictable substring-translations Structural transfer Evaluation find Bokmål entries without translations Coverage run string replacements for typical differences WER and B LEU (-hjem-→-heim-, -lig→-leg, . . . ) Future work check if the altered entries are in the Nynorsk analyser . . . and vice versa Main run gave 2500 good entries
  • 12. Reuse of Free Expanding the translational dictionary using Resources in Nynorsk↔Bokmål MT alignments Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources Method 3: Automatic word aligments The Apertium Corpora: architecture and nn-nb pipeline KDE4 software translations (400,000 words) Constraint Grammar government web pages (50,000 words, crawled with Developing bitextor) apertium-nn-nb Disambiguation and CG po-terminology (only on KDE4) conversion Translation dictionary gave some hundreds of new terms Structural transfer morphological tagging → Giza++ → ReTraTos Evaluation Coverage about 3500 entries WER and B LEU Lots of cleaning needed Future work
  • 13. Reuse of Free Expanding the translational dictionary using Resources in Nynorsk↔Bokmål MT alignments Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources Method 3: Automatic word aligments The Apertium Corpora: architecture and nn-nb pipeline KDE4 software translations (400,000 words) Constraint Grammar government web pages (50,000 words, crawled with Developing bitextor) apertium-nn-nb Disambiguation and CG po-terminology (only on KDE4) conversion Translation dictionary gave some hundreds of new terms Structural transfer morphological tagging → Giza++ → ReTraTos Evaluation Coverage about 3500 entries WER and B LEU Lots of cleaning needed Future work Method 4: User-contributed entries (via Wikipedia)
  • 14. Reuse of Free Structural transfer Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Finite passive verbs Introduction Nynorsk and Bokmål Norwegian language resources (1) a. Bevilgning gis oftest ikke The Apertium architecture and grant.IND give.PRES. PASS usually not nn-nb pipeline Constraint Grammar b. Løyve blir oftast ikkje gjeve Developing grant.IND AUX usually not give.PART apertium-nn-nb ‘Grants are usually not given’ Disambiguation and CG conversion Translation dictionary c. Om høsten fylles fjorden med sild Structural transfer In fall.DEF fill.PRES. PASS fjord.DEF with herring Evaluation Coverage d. Om hausten blir fjorden fylt med sild WER and B LEU In fall.DEF AUX fjord.DEF fill.PRES. PASS with herring Future work ‘In fall, the fjord is filled with herring’
  • 15. Reuse of Free Structural transfer Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Genitive noun phrases Introduction Nynorsk and Bokmål Norwegian language resources (2) a. forfatterens siste utgivelse The Apertium architecture and author.DEF. GEN last publication.IND nn-nb pipeline Constraint Grammar b. den siste utgjevinga til forfattaren Developing the last publication.DEF of author.DEF apertium-nn-nb ‘the author’s last publication’ Disambiguation and CG conversion Translation dictionary c. mitt nye luftputefartøy Structural transfer my new hovercraft.IND Evaluation Coverage d. det nye luftputefartøyet mitt WER and B LEU the new hovercraft.DEF mine Future work ‘my new hovercraft’
  • 16. Reuse of Free Evaluation Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium architecture and nn-nb pipeline Coverage Constraint Grammar Developing WER apertium-nn-nb Disambiguation and CG B LEU conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 17. Reuse of Free Coverage Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium Naïve coverage on Nynorsk Wikipedia: 89.6% architecture and nn-nb pipeline Naïve coverage on Bokmål Wikipedia: 88.2% Constraint Grammar Developing Coverage seems to be the most important issue: apertium-nn-nb Disambiguation and CG Not only is every 10th word untranslated, but we get conversion Translation dictionary disambiguation problems and transfer problems in the rest Structural transfer of the sentence Evaluation Coverage WER and B LEU Future work
  • 18. Reuse of Free WER and B LEU scores in the nb→nn direction Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Word Error Rate, B LEU and Unknown Word Rate on text from government web pages Introduction Nynorsk and Bokmål Norwegian language resources The Apertium B LEU WERO WERW UWR architecture and nn-nb pipeline Apertium 0.74 32.5 (36.1) 17.7 (50.5) 9.5 Constraint Grammar Nyno 0.85 29.1 (34.6) 13.3 (47.3) 0.8 Developing apertium-nn-nb Disambiguation and CG Table: B LEU score (two reference translations) and WER (for the conversion Translation dictionary Original and Wikipedia references). Numbers in parenthesis give Structural transfer percentage of unknown words which were free-rides. Evaluation Coverage WER and B LEU Future work WER on post-edited Apertium MT output on a Wikipedia article, however, was 10.71% (64.93% free-rides) Coverage seems like the major difference.
  • 19. Reuse of Free Future work Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Compounding Introduction (3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål Norwegian language resources car.cemetery → car.cemetery The Apertium architecture and b. postordrelager → #postordrelagar nn-nb pipeline mail.order.storage → mail.order.creator Constraint Grammar Developing apertium-nn-nb Disambiguation and CG conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 20. Reuse of Free Future work Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Compounding Introduction (3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål Norwegian language resources car.cemetery → car.cemetery The Apertium architecture and b. postordrelager → #postordrelagar nn-nb pipeline mail.order.storage → mail.order.creator Constraint Grammar Developing apertium-nn-nb Multi-word expressions Disambiguation and CG conversion Translation dictionary (4) a. Han anbefalte meg å gå hjem Structural transfer he recommended me INF go home Evaluation Coverage b. Han rådte meg til å gå heim WER and B LEU Future work he counseled me to INF go home ‘He recommended that I go home’
  • 21. Reuse of Free Future work Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Compounding Introduction (3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål Norwegian language resources car.cemetery → car.cemetery The Apertium architecture and b. postordrelager → #postordrelagar nn-nb pipeline mail.order.storage → mail.order.creator Constraint Grammar Developing apertium-nn-nb Multi-word expressions Disambiguation and CG conversion Translation dictionary (4) a. Han anbefalte meg å gå hjem Structural transfer he recommended me INF go home Evaluation Coverage b. Han rådte meg til å gå heim WER and B LEU Future work he counseled me to INF go home ‘He recommended that I go home’ Expanding the Scandinavian language group
  • 22. Reuse of Free Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium architecture and Thanks for listening! nn-nb pipeline Constraint Grammar Developing apertium-nn-nb Disambiguation and CG conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 23. Reuse of Free Licences Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål This presentation may be distributed under the terms of the Norwegian language resources The Apertium GNU GPL, GNU FDL and CC-BY-SA licences. architecture and nn-nb pipeline GNU GPL v. 3.0 Constraint Grammar http://www.gnu.org/licenses/gpl.html Developing apertium-nn-nb GNU FDL v. 1.2 Disambiguation and CG conversion http://www.gnu.org/licenses/gfdl.html Translation dictionary Structural transfer CC-BY-SA v. 3.0 Evaluation Coverage http://creativecommons.org/licenses/by-sa/3.0/ WER and B LEU Future work