SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Towards OpenLogos Hybrid Translation
Anabela Barreiro
INESC-ID
anabela.barreiro@inesc-id.pt




                                                         1
Introduction with Contextual Information


 Research goals
  – OpenLogos – 1st hybrid open source machine translation solution

  – Hybridization of the OpenLogos system consists on embedding linguistic
     knowledge into statistical machine translation (SMT)

 The timing is just right…
  – Recognition by SMT researchers and developers of the need to integrate
     linguistic knowledge in machine translation (MT) systems

  – Benefit from cloud computing, big data and advanced alignment techniques,
     which contribute to an easier and faster development of new language pairs

  – Use crowd sourcing support to increase MT quality



                                                                                  2
Introduction with Contextual Information



 The ideal platform for hybrid translation
  – Logos legacy (one of the first RBMT systems - 1970)

  – Logos Corporation – one of the longest run commercial MT companies in the
     world (in business for over 30 years)

  – The Logos MT product put its emphasis on semantic understanding

  – The Logos approach was through linguistic analysis of English to render it in a
     form that was “understood” by the computing system

  – To a certain extent, the Logos approach is similar in spirit to the SMT approach,
     and complements SMT by providing answers that help overcome statistical
     weaknesses



                                                                                        3
Introduction with Contextual Information



 The open source initiative
  – OpenLogos is publicly available as open source software

  – It has some enthusiastic advocates and fervent supporters in different parts of the
     world  who believe that:

     • OpenLogos will be used as the rule-based component of a new linguistically
        enhanced hybrid translation system

     • The open source components of the OpenLogos will help the NLP/CL research
        community make scientific advances




                                                                                          4
Presentation Outline



 Background on OpenLogos MT

 System pipeline architecture

 SAL representation language

 Classic problems with rule-driven systems

 How SAL benefits translation

 Advantages of the OpenLogos architecture

 Uniqueness of the OpenLogos MT system

 Exploiting OpenLogos resources for new applications

 Availability of OpenLogos free resources

                                                        5
Background to OpenLogos



 Open source copy of the Logos system (1970-2001) adapted by DFKI
  – Developed in US, Germany, Italy

  – 25-100 development staff for 30 years

  – + 80 million US Dollar Investment

 8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT
  GR-EN, GE-FR, GE-IT

 Commercial product was considered high quality

 Industrial strength MT used successfully in 12 countries

 Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP,
  Siemens-Nixdorg, Oce Netherlands, and Union Fenosa

                                                                              6
OpenLogos Characteristics


 Multi-target System
  – One source language analysis can generate any number of targets

 Pipeline Architecture

 Language-neutral Software
  – All linguistic knowledge is in data files, stored in a relational database

 Semantico-Syntactic Abstraction Language (SAL Representation)
  – Taxonomy-ontology

  – NL sentences entering the system are immediately converted into SAL sentences

  – SAL is the driving force of the OpenLogos process

 Semantic Processing
  – Semantic Table (= SEMTAB) containing thousands of transformation rules
                                                                                    7
OpenLogos Pipeline Architecture


    Input

                                          SAL Rules
    Format                                                      SEMTAB
               RES1
                             RES2
                                    P1
                                            P2
•     Highly Modular                                   P3
                                                                    P4
•     Incremental Processing
•     Multi-Target System
                                                                            S
•     Bottom-up Analysis                                             T4
•     Deterministic Parse                                  T3
                                              T2
                                     T1
                       GEN
                                            Target Rules           SEMTAB
    Format
                                            Target Rules           SEMTAB
                                            Target Rules           SEMTAB
     Output                                                                     8
Incremental Source Analysis - 1


    Enter
   Pipeline
                                           SAL Rules
                Format                                     SEMTAB

                                    RES1

                                                         RES2


Clause Segmentation                ways of cooking lentils     - V
Homograph Resolution               types of [cooking utensils] - ADJ

Deterministic parsing requires that all ambiguous PoS be resolved (98% precision)

                                                                                    9
Incremental Source Analysis - 2


                                               SAL Rules
  Parse1                                                       Semtab


                           Parse2

• Simple NP                                      Parse3
 • Semantic
  resolution
                        • NP Prep NP                            Parse4
                           • Relative              • Verb
                             clauses
                          • Semantic
                                                 semantics
                                               •Complex NP
                                                                             S
                            resolution            • Simple      •Order in
                                                   clauses      complex
                                                • Semantic      sentences
                                                  resolution   • Semantic
               E.g: a book on the presidency
               on = about; concerning                           resolution
               ≠ a book on the table
               on = over
                                                                                 10
                                                                                 10
SAL Representation Language


SAL - Semantico-syntactic Abstraction Language

 SAL Taxonomy: 3 levels organized hierarchically

  – Supersets / Sets / Subsets

 Semantico-Syntactic continuum from NL word to Word Class
  –   Literal word:       airport
  –   Head morph:         port
  –   SAL Subset:         Agfunc (agentive functional location)
  –   SAL Set:            func    (functional location)
  –   SAL Superset:       PL      (place)
  –   Word Class: N

      Both Pipeline Input Stream and Rulebases are expressed in SAL

                                                                      11
SAL Noun Supersets




                                                 E.g: two pieces of cake
Developed:
- inductively                                    NP parse must have:
- by trial and error                             - Plural morphology of pieces
- over a period of years                         - Semantics of cake
- by the development team




                                                                            12
Abstract Noun Taxonomy
 Abstract Noun Superset 


Non-verbal Abstract Set 



                                                         Non-verbal
                                                        Subsets



                                          Classifications


   Verbal Abstract Set         Methods / Procedures



                                                       Verbal
                                                        Subsets




                                                                   13
Use of SAL Codes to Resolve Homographs



Is the word cooking a verb or an adjective?

              ways of cooking lentils
              types of cooking utensils

ways              N(AB/method)               parser verb bias
types             N(AB/class)                non-verb bias


                                                 SAL contributes to
The SAL code N(AB/method) in the rule             the resolution of
matches on a similar code in the SAL input         the homograph
stream.

The effect of such a match is to resolve
cooking as a verb
                                                                      14
What SAL Rules Look Like


                              Rules Have Five Components
 SAL Pattern
  – PARSE2 example:           N(IN/data;u) Prep(“on”;u) N(u;u) (a book on the presidency)
 Constraints
  – Match only if conditions are true or false
 Source Actions
  – RES Rulebase:             Resolves syntactic ambiguity
  – PARSE Rulebase:           Creates parse tree
  – SEMTAB Rules:             Effects semantic disambiguation
 Target Action (optional)
  – Effects syntactic and/or semantic transfer
 Comment Line
  – PARSE2 example:           NP(info) Prep(“on”) NP  N1 “about” N2
                              E.g., book on political satire  book about ....


                                                                                            15
Classic Problem of RBMT


 Complexity
  – Logic saturation
  – Rulebase grows too large
  – Performance degradation
  – Difficult maintainability
  – System improvability stasis


 Ambiguity
  – Quality/accuracy of output – depends on effective disambiguation
  – Effective disambiguation cause rulebase growth


 Classic Dilemma of the Developer
  – Reduce rulebase size to relieve complexity weakens disambiguation
  – Increase rulebase size to address ambiguities increases complexity
                                                                         16
How OpenLogos Addresses Complexity and
                 Ambiguity


 Complexity
  – Rules and input stream are expressed as SAL patterns

  – Homogeneous ‘apples-to-apples’ matching

  – Rules are SAL patterns stored/organized in an indexed pattern dictionary

  – SAL input stream serves as search argument to SAL rulebase

  – No limit on rule size and no impact on performance

  – Rules are self organizing

  – Rulebase is easy to maintain



                                                                               17
How Rules Are Applied


       Metaphor: biological neural net
                                                                               As the analysis
                                                                                progresses:
                                                                                  1- cells
                                                                               become fewer
                                                                                 (abstract
                                                                               nature of the
                                                                                  parse)
                                                                                  2- vectors
                                                                               become lighter
                                                                                  (semantic
                                                                               dismbiguation)

– Vectors labeled V1-V6 = SAL input stream of the pipeline
– Cells in input vectors = SAL elements/words to which the NL input stream has been
  converted
– In this network, R1 through P4 = hidden layers containing SAL rules
– R1 represents RES1, P1 represents Parse1 and so on.
– Each hidden layer contains between 2-4 thousand rules, organized by their SAL
  pattern, as in a dictionary.
                                                                                             18
How Rules Are Applied


    Metaphor: biological neural net




 Chief similarity
  – Efficient interaction between the SAL input stream and the rules of the
     hidden layers

  – Only those rules which should be looked at are accessed

  – The developer does not need to develop metarules or discrimination
     networks to achieve efficiency in rule matching

  – Efficiency in rule matching is an automatic by-product of system design




                                                                              19
How OpenLogos Addresses Complexity and
                 Ambiguity



 Ambiguity

  – Syntactic Homograph Resolution

  – Scoping of adjectives, prepositions

  – Polysemy




                                            20
Resolution of Polysemy in OpenLogos


      SAL Representation Language in interaction with SEMTAB

SEMTAB provides a transfer that overrides the default dictionary transfer
for the verb “raise”


NL String          SEMTAB Rule                            Portuguese Transfer
raise a child          V(‘raise’) N(ANdes)                        criar. . .
raise corn             V(‘raise’) N(MAedib)                       cultivar. . .
raise the rent         V(‘raise’) N(MEabs)                        aumentar. . .




                                                                                    21
Deep Structure Rules of SEMTAB



        A single deep-structure rule matches multiple surface-structures
                     and produces correct target transfers



he raised the rent              ele aumentou a renda                V+Object
the raising of the rent         o aumento da renda                  Gerund
the rent, raised by …           a renda, aumentada por…             Part. ADJ
a rent raise                    um aumento de renda                 Noun




                                                                                22
How SAL Benefits Translation


            Examples showing
          voice transformations

  EN passive voice >>> FR active voice

The situation was alluded to by my friend in his letter
Mon ami a fait allusion Ă  la situation dans sa lettre

The situation was alluded to in their letter
On a fait allusion Ă  la situation dans leur lettre


                                    Voice transformations are possible due to:
                                    • incremental pipeline approach
                                    • strong semantic sensitivity

                                                                                 23
Advantages of OpenLogos
         Machine Translation Architecture



 Creation of systems involving small or neglected/endangered languages
  – not targeted by commercial programs
  – to fulfil the goals of administrations and NGOs dealing with these
     languages, contributing to their promotion and/or revival
 Freely available
  – any user can access the technology
 Customizable - institutions or businesses adopting an open-source MT can
  customize the system to their needs in many ways
  – developing new linguistic data (vocabularies, rules, corpora)
  – integrating system/data with other packages
  – etc.



                                                                             24
OpenLogos Uniqueness



 Extensible dictionaries with underlying semantic foundation
 Analyses whole source sentences, considering:
  – Morphology
  – Meaning (semantics)
  – Grammatical structure and function
 Semantico-Syntactic Abstraction Language (SAL)
  – the parser is able to achieve better results than syntactic analysis alone
     would allow.
 Parsing is only source language specific; generation is target language
  specific
 Originally a transfer approach, evolved to the present system (which has
  interlingual features inherent to the system)



                                                                                 25
OpenLogos Uniqueness


 OpenLogos comprehensive analysis permits to construct a complete and
  idiomatically correct translation in the target language
 OpenLogos is suitable for research and academic use
  – make OpenLogos the standard MT platform for universities, education and
     other governmental institutions
  – bring new life into a dormant technology (Phoenix rising metaphor)
 OpenLogos linguistic data representation can be established as the
  foundation
  – freely available for private and commercial use
  – there is still need for the provision of linguistic and technical services
     and/or customer support on a fee basis
  – packaging OpenLogos with the top five Linux distributions will generate a
     constant revenue stream
 OpenLogos has an ideal platform for a hybrid MT solution

                                                                                 26
Contribution of OpenLogos Resources for New NLP
                   Applications

                                      Initially, OpenLogos EN-PT dictionary data were adapted and enhanced
                                         with new properties (derivational, etc.) to create a new resource:
                                           Port4NooJ (http://www.linguateca.pt/Repositorio/Port4NooJ/).
                                                             ReEscreve uses Port4NooJ.
  SPIDER
   – System for Paraphrasing In Document Editing and Revision.
   – Based on NooJ’s technology (http://ww.nooj4nlp.net/)
   – Publicly available at: http://www.linguateca.pt/ReEscreve/
   – Designed to help with writing optimization, but its applicability extends to MT
     pre-editing.

        1st version – ReEscreve (for Portuguese) and ReWriter (for English)
        2nd version – eSPERTo (Portuguese: the smart/clever one; expert)
         Designed for integration in a cyber school project within the scope of an
         educational program to teach students how to improve their writing skills in
         the Portuguese language
        EXPERT (prototype) - to assist writing of domain-specific texts
                                                                                                       27
Contribution of OpenLogos Resources for New NLP
                   Applications


  ParaMT
   – Bilingual/multilingual paraphraser (translator prototype)
   – Uses similar methodology to that employed by SPIDER
   – Uses bilingual data
   – Directly applicable to MT

  Corpógrafo
   – Multilingual corpora management tool
   – Available at: http://www.linguateca.pt/corpografo/




                                                                 28
Uses of SPIDER



–   Authoring aid (word processing applications)
–   Language composition tool
–   Text production and style editor
–   Empirical testbed for linguistic quality assurance
–   Text (pre-)editing (machine translation)
–   “Revision memory” tool (≈ “translation memory”)
–   Applicable to general and technical language
    When integrating terminologies, it helps writing in technical domains
    (e.g. student texts - ReWriter or legal texts - EXPERT)




                                                                            30
ReEscreve: Suggestions for Text Rewriting

                         Paraphrases of SVC
                            presented by
                             ReEscreve’s
                        paraphrasing system




                                              31
ReEscreve: a Rewritten Text


                            Text rewritten based
                                on the user’s
                                 preferences


        Users can suggest
        new expressions!




                                                   32
Suggestions for Text ReWriting

Suggestions for general language
     linguistic phenomena



                                                       Compound adverbs
                                                        > single adverbs




                                                           Relatives > participial
                                                                 adjectives



                                       Support verb constructions
                                            > single verbs




                                                                                     34
Selection of paraphrasing grammars for specific
                   linguistic phenomena
 Users can select among general and technical dictionaries (more than one selection allowed),
grammars for specific linguistic transformations (one, several or all grammars can be selected).
                         The interface provides sample texts for testing.

                                                                                   Informative details about the
                                                                                   linguistic resources selected




                                                              Sample LEGAL
                                                                  text




                                                                                                                   35
Selection of a Domain Dictionary




                                                  Identification of legal terms in the text




Suggestions for the term “breach of
               law”

                      Users can select one term from the list of suggestions or provide a new
                                                                                              36
                                                   suggestion
Suggestions provided and user’s capability to add
             new rewriting options




                                                                          The user can suggest new words or
                                                                        expressions (synonyms or paraphrases)

                                                                          It is possible to go back and change the
                                                                          user option as many times as necessary

                             Text rewritten
              • In red, the expressions in the source text
•   In green, suggestions provided by SPIDER and selected by the user


                                                                                                                37
ParaMT: a Paraphraser Applicable to MT

PT support verb construction   >   EN verbs




                                                MACHINE
                                              TRANSLATION

                                                 Recognition of
                                                Portuguese SVC
                                                and translation
                                               into English verbs




                                                                    38
                                                        $EN
Selected Publications on Paraphrasing Applications


  Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision -
     Applicability in Machine Translation Pre-Editing". Computational Linguistics and
     Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011),
     pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642-
     19400-9. Part II, Lecture Notes in Computer Science

  Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In AntĂłnio Teixeira, Vera
     LĂşcia Strube de Lima, LuĂ­s Caldas de Oliveira & Paulo Quaresma (eds.), Computational
     Processing of the Portuguese Language, 8th International Conference, Proceedings
     (PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag.
     Lecture Notes in Computer Science,pp. 202-211.

  Anabela Barreiro & LuĂ­s Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose
     paraphrasing software tool". In Marie-JosÊe Goulet, Christiane Melançon, Alain DÊsilets &
     Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New
     Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa,
     Ontario, Canada, 29 August 2009), pp. 1-8.




                                                                                                  39
OpenLogos for Indian Languages




 Anusaaraka group at LTRC, IIIT-Hyderabad

  – Integrating OpenLogos in their English to Hindi Language accessor

  – An OpenLogos-based English-Hindi MT prototype is already functional,
    but needs refinement before release

     Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based
     Machine Translation System". In Proceedings of 2010 IEEE International Conference on
     Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing,
     China, Aug 21- 23, 2010.

 Kalinga Institute of Industrial Technology, KIIT

  – Setting up a research lab with MT based on OpenLogos technology


                                                                                            40
Other Efforts with OpenLogos


 Department of Political, Social and Communication Sciences,
  University of Salerno

  – PhD dissertation where the OpenLogos English-Italian SEMTAB rules
    methodology was applied, supported with the NooJ NLP environment to
    represent the theoretical and methodological principles of the Lexicon-
    Grammar Theory

    Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and
    using linguistic resources for multi-word unit processing in Machine Translation

 Southern African main universities

  – Initial efforts to bring OpenLogos as a MT platform for translation
    between English and the African languages (scarce resources, lack of
    parallel corpora, etc.) in a initiative similar to that one done for Indian
    languages
                                                                                               41
OpenLogos Resources at DFKI



 The Language Technology Lab of DFKI has adapted OpenLogos from the
  commercial Logos System

 Also at Sourceforge under a GPL license
  http://openlogos-mt.sourceforge.net/

 OpenLogos employs only open source components:

  – Use of open source development tools and compilers, such as GCC
  – Replacement of non-open code and libraries
  – Use of open source databases instead of a commercial database. All
    language specific resources have been converted to PostgreSQL
  – Use of open standards instead of vendor specific protocols
  – As a proof of concept for the software migration, Linux is used as target
    platform for the first open source release of Logos

                                                                                42
OpenLogos Components




   Core code libraries of the server side system and basic executables to start
    and run the system (APITest, logos_batch)

   Resources, such as analysis (RES) and transfer (TRAN) grammars for
    source and target languages, and a multi-language dictionary database

   Tools: LogosTermBuilder, User administration (LogosAdmin), Command
    line tools (APITest, openlogos), and multi-user GUI for initiating and
    inspecting translation jobs and results (LogosTransCenter)




                                                                                   43
DFKI User Assistance with OpenLogos




 DFKI hosts an open OpenLogos mailing list dedicated to discussion
  and exchange of information concerning OpenLogos developments and
  problems at:

  http://www.dfki.de/mailman/listinfo/openlogos-list



 LinkedIn Discussion Group on OpenLogos Machine Translation

 OpenLogos Facebook page




                                                                      44
Selected Publications


A few publications and technical papers are available with description of

 the SAL representation language

 the system architecture and workflow

Anabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based
   Machine Translation: Philosophy, Model, Resources, and Customization. In Machine
   Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922-
   6567. DOI: 10.1007/s10590-011-9091-z

Bernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language.
    In Proceedings of the First International Workshop on Free/Open-Source Rule-Based
    Machine Translation. Edited by Juan Antonio PĂŠrez-Ortiz, Felipe SĂĄnchez-MartĂ­nez, Francis
    M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas
    Informáticos. 2–3 November 2009, pp. 19–26

Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18
    (2003), pp. 1–72.
                                                                                                45
Towards OpenLogos Hybrid Translation
Anabela Barreiro
INESC-ID
anabela.barreiro@inesc-id.pt




                                                         46

Weitere ähnliche Inhalte

Ähnlich wie Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012MediaEval2012
 
Platform-independent static binary code analysis using a meta-assembly language
Platform-independent static binary code analysis using a meta-assembly languagePlatform-independent static binary code analysis using a meta-assembly language
Platform-independent static binary code analysis using a meta-assembly languagezynamics GmbH
 
Compiler Design
Compiler DesignCompiler Design
Compiler DesignMir Majid
 
tdt4260
tdt4260tdt4260
tdt4260jonecx
 
S-CUBE LP: Executing the HOCL: Concept of a Chemical Interpreter
S-CUBE LP: Executing the HOCL: Concept of a Chemical InterpreterS-CUBE LP: Executing the HOCL: Concept of a Chemical Interpreter
S-CUBE LP: Executing the HOCL: Concept of a Chemical Interpretervirtual-campus
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesSchwannden Kuo
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
 
Compier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.pptCompier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.pptApoorv Diwan
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
cade23-schneidsut-atp4owlfull-2011
cade23-schneidsut-atp4owlfull-2011cade23-schneidsut-atp4owlfull-2011
cade23-schneidsut-atp4owlfull-2011Michael Schneider
 
Compier Design_Unit I.ppt
Compier Design_Unit I.pptCompier Design_Unit I.ppt
Compier Design_Unit I.pptsivaganesh293
 

Ähnlich wie Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro (20)

Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
Platform-independent static binary code analysis using a meta-assembly language
Platform-independent static binary code analysis using a meta-assembly languagePlatform-independent static binary code analysis using a meta-assembly language
Platform-independent static binary code analysis using a meta-assembly language
 
Compiler Design
Compiler DesignCompiler Design
Compiler Design
 
tdt4260
tdt4260tdt4260
tdt4260
 
Compiler1
Compiler1Compiler1
Compiler1
 
S-CUBE LP: Executing the HOCL: Concept of a Chemical Interpreter
S-CUBE LP: Executing the HOCL: Concept of a Chemical InterpreterS-CUBE LP: Executing the HOCL: Concept of a Chemical Interpreter
S-CUBE LP: Executing the HOCL: Concept of a Chemical Interpreter
 
1 compiler outline
1 compiler outline1 compiler outline
1 compiler outline
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Programming in c++
Programming in c++Programming in c++
Programming in c++
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Compier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.pptCompier Design_Unit I_SRM.ppt
Compier Design_Unit I_SRM.ppt
 
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
cade23-schneidsut-atp4owlfull-2011
cade23-schneidsut-atp4owlfull-2011cade23-schneidsut-atp4owlfull-2011
cade23-schneidsut-atp4owlfull-2011
 
Compier Design_Unit I.ppt
Compier Design_Unit I.pptCompier Design_Unit I.ppt
Compier Design_Unit I.ppt
 

Mehr von INESC-ID (Spoken Language Systems Laboratory - L2F)

Mehr von INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Multi3Generation@INGL2020
Multi3Generation@INGL2020Multi3Generation@INGL2020
Multi3Generation@INGL2020
 
NooJ 2020 presentation
NooJ 2020 presentationNooJ 2020 presentation
NooJ 2020 presentation
 
PROPOR2020_Barreiroetal
PROPOR2020_BarreiroetalPROPOR2020_Barreiroetal
PROPOR2020_Barreiroetal
 
Anålise comparativa das ediçþes portuguesa e brasileira de Os livros que dev...
Anålise comparativa das ediçþes portuguesa e brasileira de  Os livros que dev...Anålise comparativa das ediçþes portuguesa e brasileira de  Os livros que dev...
Anålise comparativa das ediçþes portuguesa e brasileira de Os livros que dev...
 
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST ActionWelcome session 3rd Annual MC Meeting - enetCollect COST Action
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
 
Cross language semantic relations between English and Portuguese
Cross language semantic relations between English and PortugueseCross language semantic relations between English and Portuguese
Cross language semantic relations between English and Portuguese
 
Paraphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translationParaphrasing biomedical support verb constructions for machine translation
Paraphrasing biomedical support verb constructions for machine translation
 
ReWriter for legal text
ReWriter for legal textReWriter for legal text
ReWriter for legal text
 
Chatbots for Language Learning
Chatbots for Language LearningChatbots for Language Learning
Chatbots for Language Learning
 
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and SummarizationeSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
 
Barreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-languageBarreiro et al POP@PROPOR2018-informal2formal-language
Barreiro et al POP@PROPOR2018-informal2formal-language
 
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignmentsRebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
 
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
 
Barreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-posterBarreiro-Mota-VarDial@Coling2018-poster
Barreiro-Mota-VarDial@Coling2018-poster
 
NooJ-2018-Palermo
NooJ-2018-PalermoNooJ-2018-Palermo
NooJ-2018-Palermo
 
Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania Poster @ enetCollect CA MC meeting in Iasi, Romania
Poster @ enetCollect CA MC meeting in Iasi, Romania
 
projeto-eSPERTo
projeto-eSPERToprojeto-eSPERTo
projeto-eSPERTo
 
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software ToolReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
 
Poster l2f 2017
Poster l2f 2017Poster l2f 2017
Poster l2f 2017
 

KĂźrzlich hochgeladen

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

KĂźrzlich hochgeladen (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

  • 1. Towards OpenLogos Hybrid Translation Anabela Barreiro INESC-ID anabela.barreiro@inesc-id.pt 1
  • 2. Introduction with Contextual Information  Research goals – OpenLogos – 1st hybrid open source machine translation solution – Hybridization of the OpenLogos system consists on embedding linguistic knowledge into statistical machine translation (SMT)  The timing is just right… – Recognition by SMT researchers and developers of the need to integrate linguistic knowledge in machine translation (MT) systems – Benefit from cloud computing, big data and advanced alignment techniques, which contribute to an easier and faster development of new language pairs – Use crowd sourcing support to increase MT quality 2
  • 3. Introduction with Contextual Information  The ideal platform for hybrid translation – Logos legacy (one of the first RBMT systems - 1970) – Logos Corporation – one of the longest run commercial MT companies in the world (in business for over 30 years) – The Logos MT product put its emphasis on semantic understanding – The Logos approach was through linguistic analysis of English to render it in a form that was “understood” by the computing system – To a certain extent, the Logos approach is similar in spirit to the SMT approach, and complements SMT by providing answers that help overcome statistical weaknesses 3
  • 4. Introduction with Contextual Information  The open source initiative – OpenLogos is publicly available as open source software – It has some enthusiastic advocates and fervent supporters in different parts of the world  who believe that: • OpenLogos will be used as the rule-based component of a new linguistically enhanced hybrid translation system • The open source components of the OpenLogos will help the NLP/CL research community make scientific advances 4
  • 5. Presentation Outline  Background on OpenLogos MT  System pipeline architecture  SAL representation language  Classic problems with rule-driven systems  How SAL benefits translation  Advantages of the OpenLogos architecture  Uniqueness of the OpenLogos MT system  Exploiting OpenLogos resources for new applications  Availability of OpenLogos free resources 5
  • 6. Background to OpenLogos  Open source copy of the Logos system (1970-2001) adapted by DFKI – Developed in US, Germany, Italy – 25-100 development staff for 30 years – + 80 million US Dollar Investment  8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT GR-EN, GE-FR, GE-IT  Commercial product was considered high quality  Industrial strength MT used successfully in 12 countries  Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP, Siemens-Nixdorg, Oce Netherlands, and Union Fenosa 6
  • 7. OpenLogos Characteristics  Multi-target System – One source language analysis can generate any number of targets  Pipeline Architecture  Language-neutral Software – All linguistic knowledge is in data files, stored in a relational database  Semantico-Syntactic Abstraction Language (SAL Representation) – Taxonomy-ontology – NL sentences entering the system are immediately converted into SAL sentences – SAL is the driving force of the OpenLogos process  Semantic Processing – Semantic Table (= SEMTAB) containing thousands of transformation rules 7
  • 8. OpenLogos Pipeline Architecture Input SAL Rules Format SEMTAB RES1 RES2 P1 P2 • Highly Modular P3 P4 • Incremental Processing • Multi-Target System S • Bottom-up Analysis T4 • Deterministic Parse T3 T2 T1 GEN Target Rules SEMTAB Format Target Rules SEMTAB Target Rules SEMTAB Output 8
  • 9. Incremental Source Analysis - 1 Enter Pipeline SAL Rules Format SEMTAB RES1 RES2 Clause Segmentation ways of cooking lentils - V Homograph Resolution types of [cooking utensils] - ADJ Deterministic parsing requires that all ambiguous PoS be resolved (98% precision) 9
  • 10. Incremental Source Analysis - 2 SAL Rules Parse1 Semtab Parse2 • Simple NP Parse3 • Semantic resolution • NP Prep NP Parse4 • Relative • Verb clauses • Semantic semantics •Complex NP S resolution • Simple •Order in clauses complex • Semantic sentences resolution • Semantic E.g: a book on the presidency on = about; concerning resolution ≠ a book on the table on = over 10 10
  • 11. SAL Representation Language SAL - Semantico-syntactic Abstraction Language  SAL Taxonomy: 3 levels organized hierarchically – Supersets / Sets / Subsets  Semantico-Syntactic continuum from NL word to Word Class – Literal word: airport – Head morph: port – SAL Subset: Agfunc (agentive functional location) – SAL Set: func (functional location) – SAL Superset: PL (place) – Word Class: N Both Pipeline Input Stream and Rulebases are expressed in SAL 11
  • 12. SAL Noun Supersets E.g: two pieces of cake Developed: - inductively NP parse must have: - by trial and error - Plural morphology of pieces - over a period of years - Semantics of cake - by the development team 12
  • 13. Abstract Noun Taxonomy Abstract Noun Superset  Non-verbal Abstract Set   Non-verbal Subsets Classifications Verbal Abstract Set  Methods / Procedures Verbal Subsets 13
  • 14. Use of SAL Codes to Resolve Homographs Is the word cooking a verb or an adjective? ways of cooking lentils types of cooking utensils ways  N(AB/method)  parser verb bias types  N(AB/class)  non-verb bias SAL contributes to The SAL code N(AB/method) in the rule the resolution of matches on a similar code in the SAL input the homograph stream. The effect of such a match is to resolve cooking as a verb 14
  • 15. What SAL Rules Look Like Rules Have Five Components  SAL Pattern – PARSE2 example: N(IN/data;u) Prep(“on”;u) N(u;u) (a book on the presidency)  Constraints – Match only if conditions are true or false  Source Actions – RES Rulebase: Resolves syntactic ambiguity – PARSE Rulebase: Creates parse tree – SEMTAB Rules: Effects semantic disambiguation  Target Action (optional) – Effects syntactic and/or semantic transfer  Comment Line – PARSE2 example: NP(info) Prep(“on”) NP  N1 “about” N2 E.g., book on political satire  book about .... 15
  • 16. Classic Problem of RBMT  Complexity – Logic saturation – Rulebase grows too large – Performance degradation – Difficult maintainability – System improvability stasis  Ambiguity – Quality/accuracy of output – depends on effective disambiguation – Effective disambiguation cause rulebase growth  Classic Dilemma of the Developer – Reduce rulebase size to relieve complexity weakens disambiguation – Increase rulebase size to address ambiguities increases complexity 16
  • 17. How OpenLogos Addresses Complexity and Ambiguity  Complexity – Rules and input stream are expressed as SAL patterns – Homogeneous ‘apples-to-apples’ matching – Rules are SAL patterns stored/organized in an indexed pattern dictionary – SAL input stream serves as search argument to SAL rulebase – No limit on rule size and no impact on performance – Rules are self organizing – Rulebase is easy to maintain 17
  • 18. How Rules Are Applied Metaphor: biological neural net As the analysis progresses: 1- cells become fewer (abstract nature of the parse) 2- vectors become lighter (semantic dismbiguation) – Vectors labeled V1-V6 = SAL input stream of the pipeline – Cells in input vectors = SAL elements/words to which the NL input stream has been converted – In this network, R1 through P4 = hidden layers containing SAL rules – R1 represents RES1, P1 represents Parse1 and so on. – Each hidden layer contains between 2-4 thousand rules, organized by their SAL pattern, as in a dictionary. 18
  • 19. How Rules Are Applied Metaphor: biological neural net  Chief similarity – Efficient interaction between the SAL input stream and the rules of the hidden layers – Only those rules which should be looked at are accessed – The developer does not need to develop metarules or discrimination networks to achieve efficiency in rule matching – Efficiency in rule matching is an automatic by-product of system design 19
  • 20. How OpenLogos Addresses Complexity and Ambiguity  Ambiguity – Syntactic Homograph Resolution – Scoping of adjectives, prepositions – Polysemy 20
  • 21. Resolution of Polysemy in OpenLogos SAL Representation Language in interaction with SEMTAB SEMTAB provides a transfer that overrides the default dictionary transfer for the verb “raise” NL String SEMTAB Rule Portuguese Transfer raise a child  V(‘raise’) N(ANdes)  criar. . . raise corn  V(‘raise’) N(MAedib)  cultivar. . . raise the rent  V(‘raise’) N(MEabs)  aumentar. . . 21
  • 22. Deep Structure Rules of SEMTAB A single deep-structure rule matches multiple surface-structures and produces correct target transfers he raised the rent  ele aumentou a renda V+Object the raising of the rent  o aumento da renda Gerund the rent, raised by …  a renda, aumentada por… Part. ADJ a rent raise  um aumento de renda Noun 22
  • 23. How SAL Benefits Translation Examples showing voice transformations EN passive voice >>> FR active voice The situation was alluded to by my friend in his letter Mon ami a fait allusion Ă  la situation dans sa lettre The situation was alluded to in their letter On a fait allusion Ă  la situation dans leur lettre Voice transformations are possible due to: • incremental pipeline approach • strong semantic sensitivity 23
  • 24. Advantages of OpenLogos Machine Translation Architecture  Creation of systems involving small or neglected/endangered languages – not targeted by commercial programs – to fulfil the goals of administrations and NGOs dealing with these languages, contributing to their promotion and/or revival  Freely available – any user can access the technology  Customizable - institutions or businesses adopting an open-source MT can customize the system to their needs in many ways – developing new linguistic data (vocabularies, rules, corpora) – integrating system/data with other packages – etc. 24
  • 25. OpenLogos Uniqueness  Extensible dictionaries with underlying semantic foundation  Analyses whole source sentences, considering: – Morphology – Meaning (semantics) – Grammatical structure and function  Semantico-Syntactic Abstraction Language (SAL) – the parser is able to achieve better results than syntactic analysis alone would allow.  Parsing is only source language specific; generation is target language specific  Originally a transfer approach, evolved to the present system (which has interlingual features inherent to the system) 25
  • 26. OpenLogos Uniqueness  OpenLogos comprehensive analysis permits to construct a complete and idiomatically correct translation in the target language  OpenLogos is suitable for research and academic use – make OpenLogos the standard MT platform for universities, education and other governmental institutions – bring new life into a dormant technology (Phoenix rising metaphor)  OpenLogos linguistic data representation can be established as the foundation – freely available for private and commercial use – there is still need for the provision of linguistic and technical services and/or customer support on a fee basis – packaging OpenLogos with the top five Linux distributions will generate a constant revenue stream  OpenLogos has an ideal platform for a hybrid MT solution 26
  • 27. Contribution of OpenLogos Resources for New NLP Applications Initially, OpenLogos EN-PT dictionary data were adapted and enhanced with new properties (derivational, etc.) to create a new resource: Port4NooJ (http://www.linguateca.pt/Repositorio/Port4NooJ/). ReEscreve uses Port4NooJ.  SPIDER – System for Paraphrasing In Document Editing and Revision. – Based on NooJ’s technology (http://ww.nooj4nlp.net/) – Publicly available at: http://www.linguateca.pt/ReEscreve/ – Designed to help with writing optimization, but its applicability extends to MT pre-editing.  1st version – ReEscreve (for Portuguese) and ReWriter (for English)  2nd version – eSPERTo (Portuguese: the smart/clever one; expert) Designed for integration in a cyber school project within the scope of an educational program to teach students how to improve their writing skills in the Portuguese language  EXPERT (prototype) - to assist writing of domain-specific texts 27
  • 28. Contribution of OpenLogos Resources for New NLP Applications  ParaMT – Bilingual/multilingual paraphraser (translator prototype) – Uses similar methodology to that employed by SPIDER – Uses bilingual data – Directly applicable to MT  CorpĂłgrafo – Multilingual corpora management tool – Available at: http://www.linguateca.pt/corpografo/ 28
  • 29. Uses of SPIDER – Authoring aid (word processing applications) – Language composition tool – Text production and style editor – Empirical testbed for linguistic quality assurance – Text (pre-)editing (machine translation) – “Revision memory” tool (≈ “translation memory”) – Applicable to general and technical language When integrating terminologies, it helps writing in technical domains (e.g. student texts - ReWriter or legal texts - EXPERT) 30
  • 30. ReEscreve: Suggestions for Text Rewriting Paraphrases of SVC presented by ReEscreve’s paraphrasing system 31
  • 31. ReEscreve: a Rewritten Text Text rewritten based on the user’s preferences Users can suggest new expressions! 32
  • 32. Suggestions for Text ReWriting Suggestions for general language linguistic phenomena Compound adverbs > single adverbs Relatives > participial adjectives Support verb constructions > single verbs 34
  • 33. Selection of paraphrasing grammars for specific linguistic phenomena Users can select among general and technical dictionaries (more than one selection allowed), grammars for specific linguistic transformations (one, several or all grammars can be selected). The interface provides sample texts for testing. Informative details about the linguistic resources selected Sample LEGAL text 35
  • 34. Selection of a Domain Dictionary Identification of legal terms in the text Suggestions for the term “breach of law” Users can select one term from the list of suggestions or provide a new 36 suggestion
  • 35. Suggestions provided and user’s capability to add new rewriting options The user can suggest new words or expressions (synonyms or paraphrases) It is possible to go back and change the user option as many times as necessary Text rewritten • In red, the expressions in the source text • In green, suggestions provided by SPIDER and selected by the user 37
  • 36. ParaMT: a Paraphraser Applicable to MT PT support verb construction > EN verbs MACHINE TRANSLATION Recognition of Portuguese SVC and translation into English verbs 38 $EN
  • 37. Selected Publications on Paraphrasing Applications Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision - Applicability in Machine Translation Pre-Editing". Computational Linguistics and Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011), pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642- 19400-9. Part II, Lecture Notes in Computer Science Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In AntĂłnio Teixeira, Vera LĂşcia Strube de Lima, LuĂ­s Caldas de Oliveira & Paulo Quaresma (eds.), Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag. Lecture Notes in Computer Science,pp. 202-211. Anabela Barreiro & LuĂ­s Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose paraphrasing software tool". In Marie-JosĂŠe Goulet, Christiane Melançon, Alain DĂŠsilets & Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa, Ontario, Canada, 29 August 2009), pp. 1-8. 39
  • 38. OpenLogos for Indian Languages  Anusaaraka group at LTRC, IIIT-Hyderabad – Integrating OpenLogos in their English to Hindi Language accessor – An OpenLogos-based English-Hindi MT prototype is already functional, but needs refinement before release Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based Machine Translation System". In Proceedings of 2010 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing, China, Aug 21- 23, 2010.  Kalinga Institute of Industrial Technology, KIIT – Setting up a research lab with MT based on OpenLogos technology 40
  • 39. Other Efforts with OpenLogos  Department of Political, Social and Communication Sciences, University of Salerno – PhD dissertation where the OpenLogos English-Italian SEMTAB rules methodology was applied, supported with the NooJ NLP environment to represent the theoretical and methodological principles of the Lexicon- Grammar Theory Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and using linguistic resources for multi-word unit processing in Machine Translation  Southern African main universities – Initial efforts to bring OpenLogos as a MT platform for translation between English and the African languages (scarce resources, lack of parallel corpora, etc.) in a initiative similar to that one done for Indian languages 41
  • 40. OpenLogos Resources at DFKI  The Language Technology Lab of DFKI has adapted OpenLogos from the commercial Logos System  Also at Sourceforge under a GPL license http://openlogos-mt.sourceforge.net/  OpenLogos employs only open source components: – Use of open source development tools and compilers, such as GCC – Replacement of non-open code and libraries – Use of open source databases instead of a commercial database. All language specific resources have been converted to PostgreSQL – Use of open standards instead of vendor specific protocols – As a proof of concept for the software migration, Linux is used as target platform for the first open source release of Logos 42
  • 41. OpenLogos Components  Core code libraries of the server side system and basic executables to start and run the system (APITest, logos_batch)  Resources, such as analysis (RES) and transfer (TRAN) grammars for source and target languages, and a multi-language dictionary database  Tools: LogosTermBuilder, User administration (LogosAdmin), Command line tools (APITest, openlogos), and multi-user GUI for initiating and inspecting translation jobs and results (LogosTransCenter) 43
  • 42. DFKI User Assistance with OpenLogos  DFKI hosts an open OpenLogos mailing list dedicated to discussion and exchange of information concerning OpenLogos developments and problems at: http://www.dfki.de/mailman/listinfo/openlogos-list  LinkedIn Discussion Group on OpenLogos Machine Translation  OpenLogos Facebook page 44
  • 43. Selected Publications A few publications and technical papers are available with description of  the SAL representation language  the system architecture and workflow Anabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based Machine Translation: Philosophy, Model, Resources, and Customization. In Machine Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922- 6567. DOI: 10.1007/s10590-011-9091-z Bernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language. In Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation. Edited by Juan Antonio PĂŠrez-Ortiz, Felipe SĂĄnchez-MartĂ­nez, Francis M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas InformĂĄticos. 2–3 November 2009, pp. 19–26 Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18 (2003), pp. 1–72. 45
  • 44. Towards OpenLogos Hybrid Translation Anabela Barreiro INESC-ID anabela.barreiro@inesc-id.pt 46