Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Towards OpenLogos Hybrid Translation
Anabela Barreiro
INESC-ID
anabela.barreiro@inesc-id.pt

1

Introduction with Contextual Information

 Research goals
– OpenLogos – 1st hybrid open source machine translation solution

– Hybridization of the OpenLogos system consists on embedding linguistic
knowledge into statistical machine translation (SMT)

 The timing is just right…
– Recognition by SMT researchers and developers of the need to integrate
linguistic knowledge in machine translation (MT) systems

– Benefit from cloud computing, big data and advanced alignment techniques,
which contribute to an easier and faster development of new language pairs

– Use crowd sourcing support to increase MT quality

2


 The ideal platform for hybrid translation
– Logos legacy (one of the first RBMT systems - 1970)

– Logos Corporation – one of the longest run commercial MT companies in the
world (in business for over 30 years)

– The Logos MT product put its emphasis on semantic understanding

– The Logos approach was through linguistic analysis of English to render it in a
form that was “understood” by the computing system

– To a certain extent, the Logos approach is similar in spirit to the SMT approach,
and complements SMT by providing answers that help overcome statistical
weaknesses

3


 The open source initiative
– OpenLogos is publicly available as open source software

– It has some enthusiastic advocates and fervent supporters in different parts of the
world  who believe that:

• OpenLogos will be used as the rule-based component of a new linguistically
enhanced hybrid translation system

• The open source components of the OpenLogos will help the NLP/CL research
community make scientific advances

4

Presentation Outline

 Background on OpenLogos MT

 System pipeline architecture

 SAL representation language

 Classic problems with rule-driven systems

 How SAL benefits translation

 Advantages of the OpenLogos architecture

 Uniqueness of the OpenLogos MT system

 Exploiting OpenLogos resources for new applications

 Availability of OpenLogos free resources

5

Background to OpenLogos

 Open source copy of the Logos system (1970-2001) adapted by DFKI
– Developed in US, Germany, Italy

– 25-100 development staff for 30 years

– + 80 million US Dollar Investment

 8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT
GR-EN, GE-FR, GE-IT

 Commercial product was considered high quality

 Industrial strength MT used successfully in 12 countries

 Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP,
Siemens-Nixdorg, Oce Netherlands, and Union Fenosa

6

OpenLogos Characteristics

 Multi-target System
– One source language analysis can generate any number of targets

 Pipeline Architecture

 Language-neutral Software
– All linguistic knowledge is in data files, stored in a relational database

 Semantico-Syntactic Abstraction Language (SAL Representation)
– Taxonomy-ontology

– NL sentences entering the system are immediately converted into SAL sentences

– SAL is the driving force of the OpenLogos process

 Semantic Processing
– Semantic Table (= SEMTAB) containing thousands of transformation rules
7

OpenLogos Pipeline Architecture

Input

SAL Rules
Format SEMTAB
RES1
RES2
P1
P2
• Highly Modular P3
P4
• Incremental Processing
• Multi-Target System
S
• Bottom-up Analysis T4
• Deterministic Parse T3
T2
T1
GEN
Target Rules SEMTAB
Format
Target Rules SEMTAB
Target Rules SEMTAB
Output 8

Incremental Source Analysis - 1

Enter
Pipeline
SAL Rules
Format SEMTAB

RES1

RES2

Clause Segmentation ways of cooking lentils - V
Homograph Resolution types of [cooking utensils] - ADJ

Deterministic parsing requires that all ambiguous PoS be resolved (98% precision)

9

Incremental Source Analysis - 2

SAL Rules
Parse1 Semtab

Parse2

• Simple NP Parse3
• Semantic
resolution
• NP Prep NP Parse4
• Relative • Verb
clauses
• Semantic
semantics
•Complex NP
S
resolution • Simple •Order in
clauses complex
• Semantic sentences
resolution • Semantic
E.g: a book on the presidency
on = about; concerning resolution
≠ a book on the table
on = over
10
10

SAL Representation Language

SAL - Semantico-syntactic Abstraction Language

 SAL Taxonomy: 3 levels organized hierarchically

– Supersets / Sets / Subsets

 Semantico-Syntactic continuum from NL word to Word Class
– Literal word: airport
– Head morph: port
– SAL Subset: Agfunc (agentive functional location)
– SAL Set: func (functional location)
– SAL Superset: PL (place)
– Word Class: N

Both Pipeline Input Stream and Rulebases are expressed in SAL

11

SAL Noun Supersets

E.g: two pieces of cake
Developed:
- inductively NP parse must have:
- by trial and error - Plural morphology of pieces
- over a period of years - Semantics of cake
- by the development team

12

Abstract Noun Taxonomy
Abstract Noun Superset 

Non-verbal Abstract Set 

 Non-verbal
Subsets

Classifications

Verbal Abstract Set  Methods / Procedures

Verbal
Subsets

13

Use of SAL Codes to Resolve Homographs

Is the word cooking a verb or an adjective?

ways of cooking lentils
types of cooking utensils

ways  N(AB/method)  parser verb bias
types  N(AB/class)  non-verb bias

SAL contributes to
The SAL code N(AB/method) in the rule the resolution of
matches on a similar code in the SAL input the homograph
stream.

The effect of such a match is to resolve
cooking as a verb
14

What SAL Rules Look Like

Rules Have Five Components
 SAL Pattern
– PARSE2 example: N(IN/data;u) Prep(“on”;u) N(u;u) (a book on the presidency)
 Constraints
– Match only if conditions are true or false
 Source Actions
– RES Rulebase: Resolves syntactic ambiguity
– PARSE Rulebase: Creates parse tree
– SEMTAB Rules: Effects semantic disambiguation
 Target Action (optional)
– Effects syntactic and/or semantic transfer
 Comment Line
– PARSE2 example: NP(info) Prep(“on”) NP  N1 “about” N2
E.g., book on political satire  book about ....

15

Classic Problem of RBMT

 Complexity
– Logic saturation
– Rulebase grows too large
– Performance degradation
– Difficult maintainability
– System improvability stasis

 Ambiguity
– Quality/accuracy of output – depends on effective disambiguation
– Effective disambiguation cause rulebase growth

 Classic Dilemma of the Developer
– Reduce rulebase size to relieve complexity weakens disambiguation
– Increase rulebase size to address ambiguities increases complexity
16

How OpenLogos Addresses Complexity and
Ambiguity

 Complexity
– Rules and input stream are expressed as SAL patterns

– Homogeneous ‘apples-to-apples’ matching

– Rules are SAL patterns stored/organized in an indexed pattern dictionary

– SAL input stream serves as search argument to SAL rulebase

– No limit on rule size and no impact on performance

– Rules are self organizing

– Rulebase is easy to maintain

17

How Rules Are Applied

Metaphor: biological neural net
As the analysis
progresses:
1- cells
become fewer
(abstract
nature of the
parse)
2- vectors
become lighter
(semantic
dismbiguation)

– Vectors labeled V1-V6 = SAL input stream of the pipeline
– Cells in input vectors = SAL elements/words to which the NL input stream has been
converted
– In this network, R1 through P4 = hidden layers containing SAL rules
– R1 represents RES1, P1 represents Parse1 and so on.
– Each hidden layer contains between 2-4 thousand rules, organized by their SAL
pattern, as in a dictionary.
18

How Rules Are Applied

Metaphor: biological neural net

 Chief similarity
– Efficient interaction between the SAL input stream and the rules of the
hidden layers

– Only those rules which should be looked at are accessed

– The developer does not need to develop metarules or discrimination
networks to achieve efficiency in rule matching

– Efficiency in rule matching is an automatic by-product of system design

19

How OpenLogos Addresses Complexity and
Ambiguity

 Ambiguity

– Syntactic Homograph Resolution

– Scoping of adjectives, prepositions

– Polysemy

20

Resolution of Polysemy in OpenLogos

SAL Representation Language in interaction with SEMTAB

SEMTAB provides a transfer that overrides the default dictionary transfer
for the verb “raise”

NL String SEMTAB Rule Portuguese Transfer
raise a child  V(‘raise’) N(ANdes)  criar. . .
raise corn  V(‘raise’) N(MAedib)  cultivar. . .
raise the rent  V(‘raise’) N(MEabs)  aumentar. . .

21

Deep Structure Rules of SEMTAB

A single deep-structure rule matches multiple surface-structures
and produces correct target transfers

he raised the rent  ele aumentou a renda V+Object
the raising of the rent  o aumento da renda Gerund
the rent, raised by …  a renda, aumentada por… Part. ADJ
a rent raise  um aumento de renda Noun

22

How SAL Benefits Translation

Examples showing
voice transformations

EN passive voice >>> FR active voice

The situation was alluded to by my friend in his letter
Mon ami a fait allusion à la situation dans sa lettre

The situation was alluded to in their letter
On a fait allusion à la situation dans leur lettre

Voice transformations are possible due to:
• incremental pipeline approach
• strong semantic sensitivity

23

Advantages of OpenLogos
Machine Translation Architecture

 Creation of systems involving small or neglected/endangered languages
– not targeted by commercial programs
– to fulfil the goals of administrations and NGOs dealing with these
languages, contributing to their promotion and/or revival
 Freely available
– any user can access the technology
 Customizable - institutions or businesses adopting an open-source MT can
customize the system to their needs in many ways
– developing new linguistic data (vocabularies, rules, corpora)
– integrating system/data with other packages
– etc.

24

OpenLogos Uniqueness

 Extensible dictionaries with underlying semantic foundation
 Analyses whole source sentences, considering:
– Morphology
– Meaning (semantics)
– Grammatical structure and function
 Semantico-Syntactic Abstraction Language (SAL)
– the parser is able to achieve better results than syntactic analysis alone
would allow.
 Parsing is only source language specific; generation is target language
specific
 Originally a transfer approach, evolved to the present system (which has
interlingual features inherent to the system)

25

OpenLogos Uniqueness

 OpenLogos comprehensive analysis permits to construct a complete and
idiomatically correct translation in the target language
 OpenLogos is suitable for research and academic use
– make OpenLogos the standard MT platform for universities, education and
other governmental institutions
– bring new life into a dormant technology (Phoenix rising metaphor)
 OpenLogos linguistic data representation can be established as the
foundation
– freely available for private and commercial use
– there is still need for the provision of linguistic and technical services
and/or customer support on a fee basis
– packaging OpenLogos with the top five Linux distributions will generate a
constant revenue stream
 OpenLogos has an ideal platform for a hybrid MT solution

26

Contribution of OpenLogos Resources for New NLP
Applications

Initially, OpenLogos EN-PT dictionary data were adapted and enhanced
with new properties (derivational, etc.) to create a new resource:
Port4NooJ (http://www.linguateca.pt/Repositorio/Port4NooJ/).
ReEscreve uses Port4NooJ.
 SPIDER
– System for Paraphrasing In Document Editing and Revision.
– Based on NooJ’s technology (http://ww.nooj4nlp.net/)
– Publicly available at: http://www.linguateca.pt/ReEscreve/
– Designed to help with writing optimization, but its applicability extends to MT
pre-editing.

 1st version – ReEscreve (for Portuguese) and ReWriter (for English)
 2nd version – eSPERTo (Portuguese: the smart/clever one; expert)
Designed for integration in a cyber school project within the scope of an
educational program to teach students how to improve their writing skills in
the Portuguese language
 EXPERT (prototype) - to assist writing of domain-specific texts
27

Contribution of OpenLogos Resources for New NLP
Applications

 ParaMT
– Bilingual/multilingual paraphraser (translator prototype)
– Uses similar methodology to that employed by SPIDER
– Uses bilingual data
– Directly applicable to MT

 Corpógrafo
– Multilingual corpora management tool
– Available at: http://www.linguateca.pt/corpografo/

28

Uses of SPIDER

– Authoring aid (word processing applications)
– Language composition tool
– Text production and style editor
– Empirical testbed for linguistic quality assurance
– Text (pre-)editing (machine translation)
– “Revision memory” tool (≈ “translation memory”)
– Applicable to general and technical language
When integrating terminologies, it helps writing in technical domains
(e.g. student texts - ReWriter or legal texts - EXPERT)

30

ReEscreve: Suggestions for Text Rewriting

Paraphrases of SVC
presented by
ReEscreve’s
paraphrasing system

31

ReEscreve: a Rewritten Text

Text rewritten based
on the user’s
preferences

Users can suggest
new expressions!

32

Suggestions for Text ReWriting

Suggestions for general language
linguistic phenomena

Compound adverbs
> single adverbs

Relatives > participial
adjectives

Support verb constructions
> single verbs

34

Selection of paraphrasing grammars for specific
linguistic phenomena
Users can select among general and technical dictionaries (more than one selection allowed),
grammars for specific linguistic transformations (one, several or all grammars can be selected).
The interface provides sample texts for testing.

Informative details about the
linguistic resources selected

Sample LEGAL
text

35

Selection of a Domain Dictionary

Identification of legal terms in the text

Suggestions for the term “breach of
law”

Users can select one term from the list of suggestions or provide a new
36
suggestion

Suggestions provided and user’s capability to add
new rewriting options

The user can suggest new words or
expressions (synonyms or paraphrases)

It is possible to go back and change the
user option as many times as necessary

Text rewritten
• In red, the expressions in the source text
• In green, suggestions provided by SPIDER and selected by the user

37

ParaMT: a Paraphraser Applicable to MT

PT support verb construction > EN verbs

MACHINE
TRANSLATION

Recognition of
Portuguese SVC
and translation
into English verbs

38
$EN

Selected Publications on Paraphrasing Applications

Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision -
Applicability in Machine Translation Pre-Editing". Computational Linguistics and
Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011),
pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642-
19400-9. Part II, Lecture Notes in Computer Science

Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In António Teixeira, Vera
Lúcia Strube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational
Processing of the Portuguese Language, 8th International Conference, Proceedings
(PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag.
Lecture Notes in Computer Science,pp. 202-211.

Anabela Barreiro & Luís Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose
paraphrasing software tool". In Marie-Josée Goulet, Christiane Melançon, Alain Désilets &
Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New
Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa,
Ontario, Canada, 29 August 2009), pp. 1-8.

39

OpenLogos for Indian Languages

 Anusaaraka group at LTRC, IIIT-Hyderabad

– Integrating OpenLogos in their English to Hindi Language accessor

– An OpenLogos-based English-Hindi MT prototype is already functional,
but needs refinement before release

Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based
Machine Translation System". In Proceedings of 2010 IEEE International Conference on
Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing,
China, Aug 21- 23, 2010.

 Kalinga Institute of Industrial Technology, KIIT

– Setting up a research lab with MT based on OpenLogos technology

40

Other Efforts with OpenLogos

 Department of Political, Social and Communication Sciences,
University of Salerno

– PhD dissertation where the OpenLogos English-Italian SEMTAB rules
methodology was applied, supported with the NooJ NLP environment to
represent the theoretical and methodological principles of the Lexicon-
Grammar Theory

Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and
using linguistic resources for multi-word unit processing in Machine Translation

 Southern African main universities

– Initial efforts to bring OpenLogos as a MT platform for translation
between English and the African languages (scarce resources, lack of
parallel corpora, etc.) in a initiative similar to that one done for Indian
languages
41

OpenLogos Resources at DFKI

 The Language Technology Lab of DFKI has adapted OpenLogos from the
commercial Logos System

 Also at Sourceforge under a GPL license
http://openlogos-mt.sourceforge.net/

 OpenLogos employs only open source components:

– Use of open source development tools and compilers, such as GCC
– Replacement of non-open code and libraries
– Use of open source databases instead of a commercial database. All
language specific resources have been converted to PostgreSQL
– Use of open standards instead of vendor specific protocols
– As a proof of concept for the software migration, Linux is used as target
platform for the first open source release of Logos

42

OpenLogos Components

 Core code libraries of the server side system and basic executables to start
and run the system (APITest, logos_batch)

 Resources, such as analysis (RES) and transfer (TRAN) grammars for
source and target languages, and a multi-language dictionary database

 Tools: LogosTermBuilder, User administration (LogosAdmin), Command
line tools (APITest, openlogos), and multi-user GUI for initiating and
inspecting translation jobs and results (LogosTransCenter)

43

DFKI User Assistance with OpenLogos

 DFKI hosts an open OpenLogos mailing list dedicated to discussion
and exchange of information concerning OpenLogos developments and
problems at:

http://www.dfki.de/mailman/listinfo/openlogos-list

 LinkedIn Discussion Group on OpenLogos Machine Translation

 OpenLogos Facebook page

44

Selected Publications

A few publications and technical papers are available with description of

 the SAL representation language

 the system architecture and workflow

Anabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based
Machine Translation: Philosophy, Model, Resources, and Customization. In Machine
Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922-
6567. DOI: 10.1007/s10590-011-9091-z

Bernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language.
In Proceedings of the First International Workshop on Free/Open-Source Rule-Based
Machine Translation. Edited by Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Francis
M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas
Informáticos. 2–3 November 2009, pp. 19–26

Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18
(2003), pp. 1–72.
45

Towards OpenLogos Hybrid Translation
Anabela Barreiro
INESC-ID
anabela.barreiro@inesc-id.pt

46

Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro

Ähnlich wie Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro (20)

Mehr von INESC-ID (Spoken Language Systems Laboratory - L2F)

Mehr von INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro