SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
CG in Apertium

 Kevin Brubeck Unhammer
University of Bergen, Norway



      14th May 2009
What is Apertium?




      An Open Source Machine Translation platform
          both source code and data have Free / Open Source licences
      Modular
          stand-alone programs communicate through standard Unix pipes
          particular language pairs need not use all modules!
      Developed by universities, companies and independent
      (volunteer and paid) developers
History of Apertium




       Initially developed for closely related languages (Portuguese ↔
       Spanish ↔ Catalan) by the Transducens group at the Universitat
       d’Alacant
       Later extended to allow more distant language pairs
       Now also involves various companies in Spain, the universities of
       Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
Language pairs



      “Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔
      Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔
      Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔
      Spanish, English → Esperanto, Spanish ↔ Galician, French ↔
      Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ←
      Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician,
      Basque → Spanish
      Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔
      English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )
Marginalised

Few free resources
Copious free resources
Modules


     Morphological dictionaries
          lttoolbox: XML format, compiles to FSTs
                Fast (seems to perform 5x faster than SFST)
          one dictionary gives both analysis and generation
     CG pre-disambiguation
     Statistical disambiguation (HMM)
     Bilingual dictionary for lexical transfer
     Shallow syntactic transfer rules
          Local re-ordering (nom adj → adj nom)
          Chunking (adj adj nom → SN[adj adj nom])
          Insertions, deletions and substitutions of lexical units and chunks
A sketch of the architecture
The Apertium Stream Format

      Simple example from Norwegian Bokmål
          “lese en” (‘read a/one’)
          Morphological analysis gives:
          ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
          /ene<vblex><imp>/en<det><ind><mf><sg>$
          After CG:
          ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf>
          /en<det><ind><mf><sg>$
      Formatting information (like HTML tags) is saved in superblanks
      making document and web translation easy
          original:
          Kva er det du <em>seier</em>?
          deformatted:
          Kva er det du[ <em>]seier[</em>]?
Visualising the process helps find errors
The platform provides


       a language-independent machine translation engine
       tools to manage the linguistic data necessary to build a machine
       translation system for a given language pair
            little programming knowledge required to get started
            graphical user interfaces that show each step in the translation
            process
            many more advanced tools (for eg. merging or sorting
            dictionaries)

       linguistic data for a growing number of language pairs
            also usable for other NLP purposes (spelling & grammar checking,
            ...)
CG in Apertium




      Used after morphological analysis for pre-disambiguation in
      Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔
      Scottish Gaelic
      Apertium’s own statistical disambiguator makes a choice if CG
      doesn’t completely disambiguate
CG in Apertium




      Norwegian CG is from the Oslo-Bergen Tagger (GPL)
      Sámi giellatekno provides Free grammars for Sámi languages
      and Faroese
      Irish grammar mostly converted manually from the An Gramadóir
      project (GPL)
      Other grammars made solely by Apertium members
Some statistics




                        Sections    Rules    Sets    Tags

               Welsh    2           98       141     128
               Breton   4           121      125     154
               Irish    1           285      298     292
        Table: Rule counts for some of the CG grammars in Apertium
Same concepts apply between modules




   CG         Apertium/lttoolbox       Apertium stream format
   wordform   surface form             books
   baseform   lemma                    book
   cohort     ambiguous lexical unit   ^books/book<n><pl>
                                       /book<vblex><pres><p3><sg>$
   reading    analysis                 /book<n><pl>/

                     Table: Terminology differences
Same format readable by all modules


        Both SFST/HFST and vislcg3 read and write the Apertium stream
        format.
        Example from the Open Morphology of Finnish, output by the
        Apertium reader in SFST/HFST:

   ^kaikki/kaikki<noun><7><a><sg><nom>$
   ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$
   ^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc>
   /syntyä<verb><52><j><act><pcpva><pl><nom>
   /syntyä<verb><52><j><act><indv><pres><pl3>$
   ^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$
   ^tasavertaisina/*tasavertaisina$
   ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$
   ^ja/*ja$
   ^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
Why Apertium


      Rule-based MT
          most languages of the world have little freely available textual
          data, let alone parallel corpora for SMT purposes; Apertium is
          thus suitable for marginalised languages
          Rule-based systems are linguistically interesting, and provide test
          beds for linguistic theory

      Reuse and Interoperability
          Monolingual dictionaries and constraint grammars are directly
          reusable for new language pairs
          apertium-dixtools: generates new language pairs from existing
          ones
          vislcg3 reads and outputs the Apertium stream format, as do
          Stuttgart/Helsinki Finite State Tools
          Free licences allow other systems to use Apertium data and tools
Why Apertium




      Open Source + fairly simple learning curve = great potential for
      contributors
           Eg. Jacob Nordfalk: entered Apertium last fall, had English →
           Esperanto pair by March 2009
      Very helpful and accessible community
Future work: dependency-based reordering in Apertium




      Currently, CG is only used for disambiguation
      Many constraint grammars out there give dependency
      information, this could be integrated into Apertium to provide
      dependency based reordering, simplifying the transfer step
Future Work: integration with Matxin

        Matxin is a Free Software sister project of Apertium which
        currently uses FreeLing for dependency analyses:

   <SENTENCE ord=’1’>
   <CHUNK ord=’2’ type=’grup-verb’ si=’top’>
     <NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE>
     <CHUNK ord=’1’ type=’sn’ si=’subj’>
       <NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’>
         <NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE>
         <NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE>
       </NODE>
     </CHUNK>
     <CHUNK ord=’3’ type=’sn’ si=’obj’>
       <NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE>
     </CHUNK>
     <CHUNK ord=’4’ type=’F-term’ si=’modnomatch’>
       <NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE>
     </CHUNK>
   </CHUNK>
   </SENTENCE>
Future work: integration with Matxin

           We would like to get CG dependency information into a
           Matxin-compatible format.
           Apertium’s CG would handle analysis while Matxin handles the
           transfer step. Eg. given the following analysis (Faroese):


   "<Í>"
           "í" Pr @ADVL> #1->3
   "<upphavi>"
           "upphav" N Neu Sg Dat Indef @P< #2->1
   "<skapti>"
           "skapa" V Ind Prt Sg @VMAIN #3->0
   "<Gud>"
           "gudur" N Msc Sg Acc Indef @<SUBJ #4->3
   "<himmal>"
           "himmal" N Msc Sg Acc Indef @<OBJ #5->3
Future work: integration with Matxin



        ...we would like to get this dependency tree structure:

   <SENTENCE ord="1">
     <NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’>
       <NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’>
         <NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/>
       </NODE>
       <NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/>
       <NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/>
     </NODE>
   </SENTENCE>


        and let Matxin do reordering and other transfer operations
Thanks for listening!
Licences



   This presentation may be distributed under the terms of the GNU GPL,
   GNU FDL and CC-BY-SA licences.
       GNU GPL v. 3.0
       http://www.gnu.org/licenses/gpl.html
       GNU FDL v. 1.2
       http://www.gnu.org/licenses/gfdl.html
       CC-BY-SA v. 3.0
       http://creativecommons.org/licenses/by-sa/3.0/

Weitere ähnliche Inhalte

Ähnlich wie Constraint Grammar and Apertium

Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
butest
 
Hello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdfHello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdf
namarta88
 
An introduction on language processing
An introduction on language processingAn introduction on language processing
An introduction on language processing
Ralf Laemmel
 

Ähnlich wie Constraint Grammar and Apertium (20)

Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Pgbr 2013 fts
Pgbr 2013 ftsPgbr 2013 fts
Pgbr 2013 fts
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax Definition
 
Aspect-oriented programming in Perl
Aspect-oriented programming in PerlAspect-oriented programming in Perl
Aspect-oriented programming in Perl
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
biopython, doctest and makefiles
biopython, doctest and makefilesbiopython, doctest and makefiles
biopython, doctest and makefiles
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnf
 
Enroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman SarwarEnroller Colloquium: Sulman Sarwar
Enroller Colloquium: Sulman Sarwar
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google Pregel
 
Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?Language Server Protocol - Why the Hype?
Language Server Protocol - Why the Hype?
 
Hello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdfHello, I need help with the following assignmentThis assignment w.pdf
Hello, I need help with the following assignmentThis assignment w.pdf
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
Easy R
Easy REasy R
Easy R
 
An introduction on language processing
An introduction on language processingAn introduction on language processing
An introduction on language processing
 
PARADIGM IT.pptx
PARADIGM IT.pptxPARADIGM IT.pptx
PARADIGM IT.pptx
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Docopt, beautiful command-line options for R, user2014
Docopt, beautiful command-line options for R,  user2014Docopt, beautiful command-line options for R,  user2014
Docopt, beautiful command-line options for R, user2014
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Constraint Grammar and Apertium

  • 1. CG in Apertium Kevin Brubeck Unhammer University of Bergen, Norway 14th May 2009
  • 2. What is Apertium? An Open Source Machine Translation platform both source code and data have Free / Open Source licences Modular stand-alone programs communicate through standard Unix pipes particular language pairs need not use all modules! Developed by universities, companies and independent (volunteer and paid) developers
  • 3. History of Apertium Initially developed for closely related languages (Portuguese ↔ Spanish ↔ Catalan) by the Transducens group at the Universitat d’Alacant Later extended to allow more distant language pairs Now also involves various companies in Spain, the universities of Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
  • 4. Language pairs “Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔ Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔ Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔ Spanish, English → Esperanto, Spanish ↔ Galician, French ↔ Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ← Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician, Basque → Spanish Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔ English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )
  • 6. Modules Morphological dictionaries lttoolbox: XML format, compiles to FSTs Fast (seems to perform 5x faster than SFST) one dictionary gives both analysis and generation CG pre-disambiguation Statistical disambiguation (HMM) Bilingual dictionary for lexical transfer Shallow syntactic transfer rules Local re-ordering (nom adj → adj nom) Chunking (adj adj nom → SN[adj adj nom]) Insertions, deletions and substitutions of lexical units and chunks
  • 7. A sketch of the architecture
  • 8. The Apertium Stream Format Simple example from Norwegian Bokmål “lese en” (‘read a/one’) Morphological analysis gives: ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf> /ene<vblex><imp>/en<det><ind><mf><sg>$ After CG: ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf> /en<det><ind><mf><sg>$ Formatting information (like HTML tags) is saved in superblanks making document and web translation easy original: Kva er det du <em>seier</em>? deformatted: Kva er det du[ <em>]seier[</em>]?
  • 9. Visualising the process helps find errors
  • 10. The platform provides a language-independent machine translation engine tools to manage the linguistic data necessary to build a machine translation system for a given language pair little programming knowledge required to get started graphical user interfaces that show each step in the translation process many more advanced tools (for eg. merging or sorting dictionaries) linguistic data for a growing number of language pairs also usable for other NLP purposes (spelling & grammar checking, ...)
  • 11. CG in Apertium Used after morphological analysis for pre-disambiguation in Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔ Scottish Gaelic Apertium’s own statistical disambiguator makes a choice if CG doesn’t completely disambiguate
  • 12. CG in Apertium Norwegian CG is from the Oslo-Bergen Tagger (GPL) Sámi giellatekno provides Free grammars for Sámi languages and Faroese Irish grammar mostly converted manually from the An Gramadóir project (GPL) Other grammars made solely by Apertium members
  • 13. Some statistics Sections Rules Sets Tags Welsh 2 98 141 128 Breton 4 121 125 154 Irish 1 285 298 292 Table: Rule counts for some of the CG grammars in Apertium
  • 14. Same concepts apply between modules CG Apertium/lttoolbox Apertium stream format wordform surface form books baseform lemma book cohort ambiguous lexical unit ^books/book<n><pl> /book<vblex><pres><p3><sg>$ reading analysis /book<n><pl>/ Table: Terminology differences
  • 15. Same format readable by all modules Both SFST/HFST and vislcg3 read and write the Apertium stream format. Example from the Open Morphology of Finnish, output by the Apertium reader in SFST/HFST: ^kaikki/kaikki<noun><7><a><sg><nom>$ ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$ ^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc> /syntyä<verb><52><j><act><pcpva><pl><nom> /syntyä<verb><52><j><act><indv><pres><pl3>$ ^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$ ^tasavertaisina/*tasavertaisina$ ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$ ^ja/*ja$ ^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
  • 16. Why Apertium Rule-based MT most languages of the world have little freely available textual data, let alone parallel corpora for SMT purposes; Apertium is thus suitable for marginalised languages Rule-based systems are linguistically interesting, and provide test beds for linguistic theory Reuse and Interoperability Monolingual dictionaries and constraint grammars are directly reusable for new language pairs apertium-dixtools: generates new language pairs from existing ones vislcg3 reads and outputs the Apertium stream format, as do Stuttgart/Helsinki Finite State Tools Free licences allow other systems to use Apertium data and tools
  • 17. Why Apertium Open Source + fairly simple learning curve = great potential for contributors Eg. Jacob Nordfalk: entered Apertium last fall, had English → Esperanto pair by March 2009 Very helpful and accessible community
  • 18. Future work: dependency-based reordering in Apertium Currently, CG is only used for disambiguation Many constraint grammars out there give dependency information, this could be integrated into Apertium to provide dependency based reordering, simplifying the transfer step
  • 19. Future Work: integration with Matxin Matxin is a Free Software sister project of Apertium which currently uses FreeLing for dependency analyses: <SENTENCE ord=’1’> <CHUNK ord=’2’ type=’grup-verb’ si=’top’> <NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE> <CHUNK ord=’1’ type=’sn’ si=’subj’> <NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’> <NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE> <NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE> </NODE> </CHUNK> <CHUNK ord=’3’ type=’sn’ si=’obj’> <NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE> </CHUNK> <CHUNK ord=’4’ type=’F-term’ si=’modnomatch’> <NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE> </CHUNK> </CHUNK> </SENTENCE>
  • 20. Future work: integration with Matxin We would like to get CG dependency information into a Matxin-compatible format. Apertium’s CG would handle analysis while Matxin handles the transfer step. Eg. given the following analysis (Faroese): "<Í>" "í" Pr @ADVL> #1->3 "<upphavi>" "upphav" N Neu Sg Dat Indef @P< #2->1 "<skapti>" "skapa" V Ind Prt Sg @VMAIN #3->0 "<Gud>" "gudur" N Msc Sg Acc Indef @<SUBJ #4->3 "<himmal>" "himmal" N Msc Sg Acc Indef @<OBJ #5->3
  • 21. Future work: integration with Matxin ...we would like to get this dependency tree structure: <SENTENCE ord="1"> <NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’> <NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’> <NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/> </NODE> <NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/> <NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/> </NODE> </SENTENCE> and let Matxin do reordering and other transfer operations
  • 23. Licences This presentation may be distributed under the terms of the GNU GPL, GNU FDL and CC-BY-SA licences. GNU GPL v. 3.0 http://www.gnu.org/licenses/gpl.html GNU FDL v. 1.2 http://www.gnu.org/licenses/gfdl.html CC-BY-SA v. 3.0 http://creativecommons.org/licenses/by-sa/3.0/