Multi lingual corpus for machine aided translation
1. Enrollment no. 9911103402
Name Aashna Phanda
Name of Supervisor Mr Himanshu
Agrawal
(Assistant Professor)
Multi Lingual Corpus for Machine
Aided Translation
2. Machine Translation:
process by which computer software is used to translate a text
One natural language (SL) another natural language (TL)
Mechanical Dictionaries suggested 17th century
First attempt two patents on mechanizing translation early 30’s
George Artsrouni’s “Mechanical-Brain”, 22nd July 1933;
Petr Smirnov-Troyanskii’s “Translating Machine”, 5th Sept. 1933
Machine Translation noticed 1949 through
Warren Weaver’s memorandum, “Translation”, published in
Machine translation of languages: fourteen essays, 1955
Introduction of mainframe computers in public R&D
Huge money and efforts invested in Machine Translation
Many applications and online translation sites developed
o Online translation sites: Google, Bing Translate, Worldlingo, Systran,
Babylon
o CAT tools: Atlas, Trados ………..
3. 3
Commonly Used Tools / Applications
Apertium Free/open-source rule-based machine translation platform
OpenLogos Free/open-source version of the historical Logos machine translation system
Anusaaraka English-Hindi Machine translation system
Moses Statistical machine translation
SDL Trados
Computer assisted translation software (CAT). Provides translation management
software, content management and language services
Google Translate Free statistical machine translation by Google Inc.
Bing Translation Statistical machine translation technology, developed by Microsoft Research
Babylon translation Computer dictionary and translation program for Microsoft Windows
Systran
Hybrid machine translation (SMT) technology; is one of the oldest machine
translation companies
Worldlingo
Translation
Hybrid MT technology, MT partner in Microsoft Windows and Microsoft Mac
Office
4. Problem Statement
There has been a revolution in translation strategies since past era. The present
translation systems generate a quite acceptable output that is understandable to a person
with knowledge of that specific language but can't be relied upon for any important legal
clause or for machine operations. The possible reason for such limitation could be the
gap between the human translator and a translation system developer.
It isn't possible for a human to give best of both worlds. The present scenario involves
interactions of a translator and developer to build a common system as the process of
translation that prevails today, needs abundant knowledge of grammar which a developer
lacks and system development requires a good programming knowledge which a
translator lacks
5. The existing problem can be solved using the Corpus based translation technique. There
is a need for a translation system that can generate good and reliable translated outputs
and the output is simultaneously generated for multiple languages.
Layer1: The unit to form entities would involve the translators only who would decide
for a unit building and derive units.
Layer 2: The algorithm as well as the logic would involve both translator and a developer
and since the concept of unit and its usage completely is very simple, it eliminates the
need of developer to be perfect in grammar and translator to learn coding, both developer
and translator can discuss the logic for the system.
Layer 3: The development can easily be done by the coders as the problem statement
would be very clear and in simple terms and the target will only be to code the algorithms
formulated.
Solution Approach
9. Novelty
The approach used in the proposed system aims to be
better than the existing system in terms of accuracy. It
extracts data from text and breaks them in units. These
units are the phrases and substrings from the text. A
generic translation of these phrases is present in the
database. This system translates the phrases and
individuals units with the help of database providing
exact translation to a phrase instead of probable
translation.
10. Limitation
Even though this system aims at providing more accurate and reliable translation, it has
some drawbacks.
• English is taken as the base language, and the text is translated to Hindi and Japanese.
• There is a need for more languages to be taken into account
• The system is bound to single domain and if further research and hard work is done, it can
be tuned to handle multiple domains & input variations.
• The algorithm used may promise to provide better results, but more time consuming.
• The units may have different translations in different language. So the translations we
store in database must be generic and not according to one or two sentences.
11. There is a lot of scope and need for improvement in this software in terms of accuracy
and better results. Some of the work that can be done to improve this system is:
• English is used as the base language. We can have other languages too for the same.
• Similarly Hindi and Japanese are the only target language of this system. System can
be expanded using more target languages.
•Managing other characters like @,.,!,(),etc
•Auto correcting the statements and spellings
•Reverse translations
•Deriving some rules for translation such as gender, plural etc.
•Finding the best translations based on automated translations
•Finding the algorithm for creating a intersections of units
Future Work