2. Introduction with Contextual Information
ďŹ Research goals
â OpenLogos â 1st hybrid open source machine translation solution
â Hybridization of the OpenLogos system consists on embedding linguistic
knowledge into statistical machine translation (SMT)
ďŹ The timing is just rightâŚ
â Recognition by SMT researchers and developers of the need to integrate
linguistic knowledge in machine translation (MT) systems
â Benefit from cloud computing, big data and advanced alignment techniques,
which contribute to an easier and faster development of new language pairs
â Use crowd sourcing support to increase MT quality
2
3. Introduction with Contextual Information
ďŹ The ideal platform for hybrid translation
â Logos legacy (one of the first RBMT systems - 1970)
â Logos Corporation â one of the longest run commercial MT companies in the
world (in business for over 30 years)
â The Logos MT product put its emphasis on semantic understanding
â The Logos approach was through linguistic analysis of English to render it in a
form that was âunderstoodâ by the computing system
â To a certain extent, the Logos approach is similar in spirit to the SMT approach,
and complements SMT by providing answers that help overcome statistical
weaknesses
3
4. Introduction with Contextual Information
ďŹ The open source initiative
â OpenLogos is publicly available as open source software
â It has some enthusiastic advocates and fervent supporters in different parts of the
world ď who believe that:
⢠OpenLogos will be used as the rule-based component of a new linguistically
enhanced hybrid translation system
⢠The open source components of the OpenLogos will help the NLP/CL research
community make scientific advances
4
5. Presentation Outline
ďŹ Background on OpenLogos MT
ďŹ System pipeline architecture
ďŹ SAL representation language
ďŹ Classic problems with rule-driven systems
ďŹ How SAL benefits translation
ďŹ Advantages of the OpenLogos architecture
ďŹ Uniqueness of the OpenLogos MT system
ďŹ Exploiting OpenLogos resources for new applications
ďŹ Availability of OpenLogos free resources
5
6. Background to OpenLogos
ďŹ Open source copy of the Logos system (1970-2001) adapted by DFKI
â Developed in US, Germany, Italy
â 25-100 development staff for 30 years
â + 80 million US Dollar Investment
ďŹ 8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT
GR-EN, GE-FR, GE-IT
ďŹ Commercial product was considered high quality
ďŹ Industrial strength MT used successfully in 12 countries
ďŹ Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP,
Siemens-Nixdorg, Oce Netherlands, and Union Fenosa
6
7. OpenLogos Characteristics
ďŹ Multi-target System
â One source language analysis can generate any number of targets
ďŹ Pipeline Architecture
ďŹ Language-neutral Software
â All linguistic knowledge is in data files, stored in a relational database
ďŹ Semantico-Syntactic Abstraction Language (SAL Representation)
â Taxonomy-ontology
â NL sentences entering the system are immediately converted into SAL sentences
â SAL is the driving force of the OpenLogos process
ďŹ Semantic Processing
â Semantic Table (= SEMTAB) containing thousands of transformation rules
7
8. OpenLogos Pipeline Architecture
Input
SAL Rules
Format SEMTAB
RES1
RES2
P1
P2
⢠Highly Modular P3
P4
⢠Incremental Processing
⢠Multi-Target System
S
⢠Bottom-up Analysis T4
⢠Deterministic Parse T3
T2
T1
GEN
Target Rules SEMTAB
Format
Target Rules SEMTAB
Target Rules SEMTAB
Output 8
9. Incremental Source Analysis - 1
Enter
Pipeline
SAL Rules
Format SEMTAB
RES1
RES2
Clause Segmentation ways of cooking lentils - V
Homograph Resolution types of [cooking utensils] - ADJ
Deterministic parsing requires that all ambiguous PoS be resolved (98% precision)
9
10. Incremental Source Analysis - 2
SAL Rules
Parse1 Semtab
Parse2
⢠Simple NP Parse3
⢠Semantic
resolution
⢠NP Prep NP Parse4
⢠Relative ⢠Verb
clauses
⢠Semantic
semantics
â˘Complex NP
S
resolution ⢠Simple â˘Order in
clauses complex
⢠Semantic sentences
resolution ⢠Semantic
E.g: a book on the presidency
on = about; concerning resolution
â a book on the table
on = over
10
10
11. SAL Representation Language
SAL - Semantico-syntactic Abstraction Language
ďŹ SAL Taxonomy: 3 levels organized hierarchically
â Supersets / Sets / Subsets
ďŹ Semantico-Syntactic continuum from NL word to Word Class
â Literal word: airport
â Head morph: port
â SAL Subset: Agfunc (agentive functional location)
â SAL Set: func (functional location)
â SAL Superset: PL (place)
â Word Class: N
Both Pipeline Input Stream and Rulebases are expressed in SAL
11
12. SAL Noun Supersets
E.g: two pieces of cake
Developed:
- inductively NP parse must have:
- by trial and error - Plural morphology of pieces
- over a period of years - Semantics of cake
- by the development team
12
14. Use of SAL Codes to Resolve Homographs
Is the word cooking a verb or an adjective?
ways of cooking lentils
types of cooking utensils
ways ď N(AB/method) ď parser verb bias
types ď N(AB/class) ď non-verb bias
SAL contributes to
The SAL code N(AB/method) in the rule the resolution of
matches on a similar code in the SAL input the homograph
stream.
The effect of such a match is to resolve
cooking as a verb
14
15. What SAL Rules Look Like
Rules Have Five Components
ďŹ SAL Pattern
â PARSE2 example: N(IN/data;u) Prep(âonâ;u) N(u;u) (a book on the presidency)
ďŹ Constraints
â Match only if conditions are true or false
ďŹ Source Actions
â RES Rulebase: Resolves syntactic ambiguity
â PARSE Rulebase: Creates parse tree
â SEMTAB Rules: Effects semantic disambiguation
ďŹ Target Action (optional)
â Effects syntactic and/or semantic transfer
ďŹ Comment Line
â PARSE2 example: NP(info) Prep(âonâ) NP ď N1 âaboutâ N2
E.g., book on political satire ď book about ....
15
16. Classic Problem of RBMT
ďŹ Complexity
â Logic saturation
â Rulebase grows too large
â Performance degradation
â Difficult maintainability
â System improvability stasis
ďŹ Ambiguity
â Quality/accuracy of output â depends on effective disambiguation
â Effective disambiguation cause rulebase growth
ďŹ Classic Dilemma of the Developer
â Reduce rulebase size to relieve complexity weakens disambiguation
â Increase rulebase size to address ambiguities increases complexity
16
17. How OpenLogos Addresses Complexity and
Ambiguity
ďŹ Complexity
â Rules and input stream are expressed as SAL patterns
â Homogeneous âapples-to-applesâ matching
â Rules are SAL patterns stored/organized in an indexed pattern dictionary
â SAL input stream serves as search argument to SAL rulebase
â No limit on rule size and no impact on performance
â Rules are self organizing
â Rulebase is easy to maintain
17
18. How Rules Are Applied
Metaphor: biological neural net
As the analysis
progresses:
1- cells
become fewer
(abstract
nature of the
parse)
2- vectors
become lighter
(semantic
dismbiguation)
â Vectors labeled V1-V6 = SAL input stream of the pipeline
â Cells in input vectors = SAL elements/words to which the NL input stream has been
converted
â In this network, R1 through P4 = hidden layers containing SAL rules
â R1 represents RES1, P1 represents Parse1 and so on.
â Each hidden layer contains between 2-4 thousand rules, organized by their SAL
pattern, as in a dictionary.
18
19. How Rules Are Applied
Metaphor: biological neural net
ďŹ Chief similarity
â Efficient interaction between the SAL input stream and the rules of the
hidden layers
â Only those rules which should be looked at are accessed
â The developer does not need to develop metarules or discrimination
networks to achieve efficiency in rule matching
â Efficiency in rule matching is an automatic by-product of system design
19
20. How OpenLogos Addresses Complexity and
Ambiguity
ďŹ Ambiguity
â Syntactic Homograph Resolution
â Scoping of adjectives, prepositions
â Polysemy
20
21. Resolution of Polysemy in OpenLogos
SAL Representation Language in interaction with SEMTAB
SEMTAB provides a transfer that overrides the default dictionary transfer
for the verb âraiseâ
NL String SEMTAB Rule Portuguese Transfer
raise a child ď V(âraiseâ) N(ANdes) ď criar. . .
raise corn ď V(âraiseâ) N(MAedib) ď cultivar. . .
raise the rent ď V(âraiseâ) N(MEabs) ď aumentar. . .
21
22. Deep Structure Rules of SEMTAB
A single deep-structure rule matches multiple surface-structures
and produces correct target transfers
he raised the rent ď ele aumentou a renda V+Object
the raising of the rent ď o aumento da renda Gerund
the rent, raised by ⌠ď a renda, aumentada por⌠Part. ADJ
a rent raise ď um aumento de renda Noun
22
23. How SAL Benefits Translation
Examples showing
voice transformations
EN passive voice >>> FR active voice
The situation was alluded to by my friend in his letter
Mon ami a fait allusion Ă la situation dans sa lettre
The situation was alluded to in their letter
On a fait allusion Ă la situation dans leur lettre
Voice transformations are possible due to:
⢠incremental pipeline approach
⢠strong semantic sensitivity
23
24. Advantages of OpenLogos
Machine Translation Architecture
ďŹ Creation of systems involving small or neglected/endangered languages
â not targeted by commercial programs
â to fulfil the goals of administrations and NGOs dealing with these
languages, contributing to their promotion and/or revival
ďŹ Freely available
â any user can access the technology
ďŹ Customizable - institutions or businesses adopting an open-source MT can
customize the system to their needs in many ways
â developing new linguistic data (vocabularies, rules, corpora)
â integrating system/data with other packages
â etc.
24
25. OpenLogos Uniqueness
ďŹ Extensible dictionaries with underlying semantic foundation
ďŹ Analyses whole source sentences, considering:
â Morphology
â Meaning (semantics)
â Grammatical structure and function
ďŹ Semantico-Syntactic Abstraction Language (SAL)
â the parser is able to achieve better results than syntactic analysis alone
would allow.
ďŹ Parsing is only source language specific; generation is target language
specific
ďŹ Originally a transfer approach, evolved to the present system (which has
interlingual features inherent to the system)
25
26. OpenLogos Uniqueness
ďŹ OpenLogos comprehensive analysis permits to construct a complete and
idiomatically correct translation in the target language
ďŹ OpenLogos is suitable for research and academic use
â make OpenLogos the standard MT platform for universities, education and
other governmental institutions
â bring new life into a dormant technology (Phoenix rising metaphor)
ďŹ OpenLogos linguistic data representation can be established as the
foundation
â freely available for private and commercial use
â there is still need for the provision of linguistic and technical services
and/or customer support on a fee basis
â packaging OpenLogos with the top five Linux distributions will generate a
constant revenue stream
ďŹ OpenLogos has an ideal platform for a hybrid MT solution
26
27. Contribution of OpenLogos Resources for New NLP
Applications
Initially, OpenLogos EN-PT dictionary data were adapted and enhanced
with new properties (derivational, etc.) to create a new resource:
Port4NooJ (http://www.linguateca.pt/Repositorio/Port4NooJ/).
ReEscreve uses Port4NooJ.
ďŹ SPIDER
â System for Paraphrasing In Document Editing and Revision.
â Based on NooJâs technology (http://ww.nooj4nlp.net/)
â Publicly available at: http://www.linguateca.pt/ReEscreve/
â Designed to help with writing optimization, but its applicability extends to MT
pre-editing.
ďŹ 1st version â ReEscreve (for Portuguese) and ReWriter (for English)
ďŹ 2nd version â eSPERTo (Portuguese: the smart/clever one; expert)
Designed for integration in a cyber school project within the scope of an
educational program to teach students how to improve their writing skills in
the Portuguese language
ďŹ EXPERT (prototype) - to assist writing of domain-specific texts
27
28. Contribution of OpenLogos Resources for New NLP
Applications
ďŹ ParaMT
â Bilingual/multilingual paraphraser (translator prototype)
â Uses similar methodology to that employed by SPIDER
â Uses bilingual data
â Directly applicable to MT
ďŹ CorpĂłgrafo
â Multilingual corpora management tool
â Available at: http://www.linguateca.pt/corpografo/
28
29. Uses of SPIDER
â Authoring aid (word processing applications)
â Language composition tool
â Text production and style editor
â Empirical testbed for linguistic quality assurance
â Text (pre-)editing (machine translation)
â âRevision memoryâ tool (â âtranslation memoryâ)
â Applicable to general and technical language
When integrating terminologies, it helps writing in technical domains
(e.g. student texts - ReWriter or legal texts - EXPERT)
30
30. ReEscreve: Suggestions for Text Rewriting
Paraphrases of SVC
presented by
ReEscreveâs
paraphrasing system
31
31. ReEscreve: a Rewritten Text
Text rewritten based
on the userâs
preferences
Users can suggest
new expressions!
32
32. Suggestions for Text ReWriting
Suggestions for general language
linguistic phenomena
Compound adverbs
> single adverbs
Relatives > participial
adjectives
Support verb constructions
> single verbs
34
33. Selection of paraphrasing grammars for specific
linguistic phenomena
Users can select among general and technical dictionaries (more than one selection allowed),
grammars for specific linguistic transformations (one, several or all grammars can be selected).
The interface provides sample texts for testing.
Informative details about the
linguistic resources selected
Sample LEGAL
text
35
34. Selection of a Domain Dictionary
Identification of legal terms in the text
Suggestions for the term âbreach of
lawâ
Users can select one term from the list of suggestions or provide a new
36
suggestion
35. Suggestions provided and userâs capability to add
new rewriting options
The user can suggest new words or
expressions (synonyms or paraphrases)
It is possible to go back and change the
user option as many times as necessary
Text rewritten
⢠In red, the expressions in the source text
⢠In green, suggestions provided by SPIDER and selected by the user
37
36. ParaMT: a Paraphraser Applicable to MT
PT support verb construction > EN verbs
MACHINE
TRANSLATION
Recognition of
Portuguese SVC
and translation
into English verbs
38
$EN
37. Selected Publications on Paraphrasing Applications
Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision -
Applicability in Machine Translation Pre-Editing". Computational Linguistics and
Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011),
pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642-
19400-9. Part II, Lecture Notes in Computer Science
Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In AntĂłnio Teixeira, Vera
LĂşcia Strube de Lima, LuĂs Caldas de Oliveira & Paulo Quaresma (eds.), Computational
Processing of the Portuguese Language, 8th International Conference, Proceedings
(PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag.
Lecture Notes in Computer Science,pp. 202-211.
Anabela Barreiro & LuĂs Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose
paraphrasing software tool". In Marie-JosÊe Goulet, Christiane Melançon, Alain DÊsilets &
Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New
Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa,
Ontario, Canada, 29 August 2009), pp. 1-8.
39
38. OpenLogos for Indian Languages
ďŹ Anusaaraka group at LTRC, IIIT-Hyderabad
â Integrating OpenLogos in their English to Hindi Language accessor
â An OpenLogos-based English-Hindi MT prototype is already functional,
but needs refinement before release
Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based
Machine Translation System". In Proceedings of 2010 IEEE International Conference on
Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing,
China, Aug 21- 23, 2010.
ďŹ Kalinga Institute of Industrial Technology, KIIT
â Setting up a research lab with MT based on OpenLogos technology
40
39. Other Efforts with OpenLogos
ďŹ Department of Political, Social and Communication Sciences,
University of Salerno
â PhD dissertation where the OpenLogos English-Italian SEMTAB rules
methodology was applied, supported with the NooJ NLP environment to
represent the theoretical and methodological principles of the Lexicon-
Grammar Theory
Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and
using linguistic resources for multi-word unit processing in Machine Translation
ďŹ Southern African main universities
â Initial efforts to bring OpenLogos as a MT platform for translation
between English and the African languages (scarce resources, lack of
parallel corpora, etc.) in a initiative similar to that one done for Indian
languages
41
40. OpenLogos Resources at DFKI
ďŹ The Language Technology Lab of DFKI has adapted OpenLogos from the
commercial Logos System
ďŹ Also at Sourceforge under a GPL license
http://openlogos-mt.sourceforge.net/
ďŹ OpenLogos employs only open source components:
â Use of open source development tools and compilers, such as GCC
â Replacement of non-open code and libraries
â Use of open source databases instead of a commercial database. All
language specific resources have been converted to PostgreSQL
â Use of open standards instead of vendor specific protocols
â As a proof of concept for the software migration, Linux is used as target
platform for the first open source release of Logos
42
41. OpenLogos Components
ďŹ Core code libraries of the server side system and basic executables to start
and run the system (APITest, logos_batch)
ďŹ Resources, such as analysis (RES) and transfer (TRAN) grammars for
source and target languages, and a multi-language dictionary database
ďŹ Tools: LogosTermBuilder, User administration (LogosAdmin), Command
line tools (APITest, openlogos), and multi-user GUI for initiating and
inspecting translation jobs and results (LogosTransCenter)
43
42. DFKI User Assistance with OpenLogos
ďŹ DFKI hosts an open OpenLogos mailing list dedicated to discussion
and exchange of information concerning OpenLogos developments and
problems at:
http://www.dfki.de/mailman/listinfo/openlogos-list
ďŹ LinkedIn Discussion Group on OpenLogos Machine Translation
ďŹ OpenLogos Facebook page
44
43. Selected Publications
A few publications and technical papers are available with description of
ďŹ the SAL representation language
ďŹ the system architecture and workflow
Anabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based
Machine Translation: Philosophy, Model, Resources, and Customization. In Machine
Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922-
6567. DOI: 10.1007/s10590-011-9091-z
Bernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language.
In Proceedings of the First International Workshop on Free/Open-Source Rule-Based
Machine Translation. Edited by Juan Antonio PĂŠrez-Ortiz, Felipe SĂĄnchez-MartĂnez, Francis
M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas
InformĂĄticos. 2â3 November 2009, pp. 19â26
Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18
(2003), pp. 1â72.
45