Assamese to English Statistical Machine Translation

Assamese- ENGLISH Statistical Machine Translation
Using Moses
PRESENTED BY
KALYANEE KANCHAN BARUAH
AND
PRANJAL DAS

CONTENTS
• INTRODUCTION
• LITERATURE REVIEW
• IMPLEMENTATION
• TRANSLITERATION IN TRANSLATION
• EVALUATION
• CONCLUSION AND FURURE WORK
• REFERENCES

INTRODUCTION
What is Natural Language Processing ?
• Natural Language Processing (NLP) is the ability
of a computer program to understand human
speech as it is spoken.
• NLP automates the translation between
computers and humans.

WHAT IS MACHINE
TRANSLATION
• Machine translation (MT) is automated
translation. It is the process by which computer
software is used to translate a text from one
natural language (such as Assamese) to
another (such as English).

WHAT IS MACHINE
TRANSLATION
• The ideal aim of machine translation systems
is to produce the best possible translation
without human assistance. Basically every
machine translation system requires programs
for translation and automated dictionaries
and grammars to support translation.

ADVANTAGES OF MACHINE
TRANSLATION
• Quick Translation
• Low price
• Confidentiality
• Online translation and translation of web page
content
• Overcomes technological barriers

PROBLEMS IN MACHINE
TRANSLATION
• Translation is not straightforward
• Word order
• Word sense
• Idioms

TYPES OF MACHINE
TRANSLATION
• BILINGUAL
– MT systems that produce translations between any
two particular languages.
• MULTILINGUAL
– MT systems that produce translations for any
given pair of languages.
– They are preferred to bi-directional and bi-lingual
as they have ability to translate from any given
language to any other given language and vice
versa

SOME EXISTING MT SYSTEMS
• Google Translate
• Systran
• Bing Translator
• Bable Fish
• Apertium

SOME MAJOR MT PROECTS IN
INDIA
• Anglabharat (and Anubharati)
• Anusaaraka
• MaTra
• UCSG-based English-Kannada MT
• Tamil-Hindi Anusaaraka and English-Tamil
MT
• Anuvadak English-Hindi software
• Sampark

MACHINE TRANSLATION
APPROACHES

STATISTICAL MACHINE
TRANSLATION
• Enables us to automatically build machine
translation systems using statistical models
trained by text data.
• Every sentence in a language has a possible
translation in another language.

STATISTICAL MACHINE
TRANSLATION

LANGUAGE MODEL
• Gives the probability of a sentence
• Uses n-gram model
• IRSTLM is used to develop the Language Model
The probability of sentence P (S), is broken down as
the probability of individual words P(w).
P(s) = P(w1, w2, w3,....., wn)
=P(w1) P(w2|w1) P(w3,|w1w2 ) P(w4|w1w2w3)...P(wn|w1w2...wn-1)

LANGUAGE MODEL
Suppose for a large amount of corpus we have the following bigram
probabilities
.001Eat British.03Eat today
.007Eat dessert.04Eat Indian
.01Eat tomorrow.04Eat a
.02Eat Mexican.04Eat at
.02Eat Chinese.05Eat dinner
.02Eat in.06Eat lunch
.03Eat breakfast.06Eat some
.03Eat Thai.16Eat on

LANGUAGE MODEL
.01British lunch.05Want a
.01British cuisine.65Want to
.15British restaurant.04I have
.60British food.08I don’t
.02To be.29I would
.09To spend.32I want
.14To have.02<start> I’m
.26To eat.04<start> Tell
.01Want Thai.06<start> I’d
.04Want some.25<start> I

TRANSLATION MODEL
• Computes the probability of source sentence ‘S’, for a
given target sentence ‘T’ i.e. P(S|T).
• May be done word based or phrase based.
• Output of TM is fed into the Moses decoder.
• Giza++ along with mkcls is used to develop Translation
Model.

TRANSLATION MODEL
Example :
জয়পুৰ ৰাজস্থানৰ এখন বিখযাত চহৰ
Jaipur is a famous city of Rajasthan

DECODER
• Maximizes the probability of the translated text
• Search for sentence T is performed that maximizes
P (S|T) i.e.
Pr (S, T) = argmax P (S|T) P (T)
DECODING
ALGORITHM
TRANSLATION
MODEL
LANGUAGE
MODEL

IMPLEMENTATION
 Install all packages in Moses
• Install Giza++
• Install IRSTLM
Training
Tuning
Generate output (decoding)

TRAINING THE MOSES
DECODER
Prepare data
Run Giza++
Get lexical translation table
Build lexicalized reordering
model
Create configuration fileBuild generation models.
Align words
Extract phrases

PREPARING THE DATA
 Tokenising - inserting spaces between words and
punctuation.
 Truecasing - setting the case of the first word in each
sentence.
 Cleaning - removing empty lines, redundant spaces,
and lines that are too short or too long.

EXAMPLE PARALLEL DATA
ass-eng1.as ass-eng1.en
বিকাণেবি ভূ বিয়া আিু বিঠাই হৈণে
বিকাণেিত বকবিি পিা অবত উত্তি
সািগ্ৰীসিূৈি বভতিত বকেুিাি।
The famous Bikaneri Bhujias and sweets
are some of the best items to purchase in
Bikaner.
ভািতিৰ্ষি গ ালপীয়া চৈি িাণি খ্যাত
িয়পুি, িািস্থাি িািযি িািধািী।
Jaipur, popularly known as the Pink City,
is the capital of Rajasthan state, India.
অম্বি গপণলচটণটা হৈণে গিা ল আিু বৈন্দু
স্থাপতয বিদ্যাি আদ্ৰ্ষ উদ্াৈিে।
The Amber Palace is a classic example of
Mughal and Hindu architecture.
কিক িৃন্দািি হৈণে িয়পুিি এখ্ি িিবিয়
িিণভাি স্থাি।
Kanak Vrindavan is a popular picnic spot
in Jaipur.
িয়পুি িািষলি িূবত্তষ, িীলা কলৈ আিু
িািস্থািী গিাতাি িাণিও বিখ্যাত।
Jaipur is also famous for marble statues,
blue pottery and the Rajasthani shoes

SAMPLE OUTPUTS
Input Assamese Sentence Output English sentence
জয়পুৰ ৰাজস্থানৰ এখন বিখযাত চহৰ । Jaipur is a famous city of Rajasthan .
তাজমহল আগ্ৰাত অৱবস্থত । the Taj Mahal , is located in the heart of the
Agra .
জামা মছবজদ শ্বাহজাহানন বনমমান কবৰবছল। Jama Masjid built by Shahjahan .
অন্ধ্ৰপ্ৰনদশ ভাৰতৰ এখন অনযতম ৰাজযৰ বভতৰত এক। Andhra Pradesh is one of the state of one of
India .
গুৱাহাটী অসমৰ ৰাজধানী। Guwahati is connected by the capital of the
State .
আগ্ৰা এখন বিখযাত চহৰ Agra is the one of the famous city
বদল্লী ভাৰতৰ ৰাজধানী। Delhi is the capital of India .

PROBLEMS WITH PROPER
NOUNS
Input Assamese Sentence Output English sentence
কানাদা এখন বিশাল দদশ । কানাদা is a vast country .
মুলতান চহৰখন ৰাজস্থানৰ পৰা ৯৯৯ বক.বম. দুৰত্বত অৱবস্থত। মুলতান from the city is located at a distance of
৯৯৯ of Rajasthan .
পানাবজ দ াৱাৰ ৰাজধানী । the capital of Goa , পানাবজ|

TRANSLITERATION IN
TRANSLATION
 Transliteration
– Transcription from one alphabet to another
 Some proper nouns which are not in our corpus
are not translated.
 For example: Translating “কানাদা এখন বিশাল দদশ”
gives
“কানাদা is a vast country.”
 Because ‘কানাদা’ is not in our corpus.

TRANSLITERATION IN
TRANSLATION
 Store each Assamese alphabet and their English transliteration in a
perl script
For example: ক -> k
খ্ -> kh
-> g
 Used this perl script and run with moses using the following
command
echo ‘কানাদা এখন বিশাল দদশ’ | ~/mymoses/bin/moses –f ~/work/mert-
work/moses.ini | ./transliterate.pl
 Output : kanada is a vast country .

IMPLEMENTING
TRANSLITERATION
INPUT ASSAMESE
SENTENCE
OUTPUT BEFORE
TRANSLITERATION
OUTPUT AFTER
TRANSLITERATION
কানাদা এখন বিশাল দদশ কানাদা is a vast country . kanada is a vast country .
মুলতান চহৰখন ৰাজস্থানৰ পৰা ৯৯৯
বক.বম. দুৰত্বত অৱবস্থত।
মুলতান from the city is located
at a distance of ৯৯৯ of
Rajasthan .
multan from the city is
located at a distance of 999
of Rajasthan .
পানাবজ দ াৱাৰ ৰাজধানী । the capital of Goa , পানাবজ| the capital of Goa , panaji .

EVALUATION OF BLEU SCORE
Source/Target Bleu Score 1/2/3/4-gram
precision
Assamese – English 7.02 30.5/8.5/4.1/2.3

CONCLUSION AND FUTURE
WORK
• The SMT is a part of corpus based MT system which
requires parallel corpus before undertaking translation.
• A parallel corpus of about 2500 Assamese and English
sentences was used to train the system.
• The SMT system developed accepts Assamese sentences
as input and generates corresponding translation in
Assamese.
• The results shows that significant improvements can be
made by increasing the amount of parallel corpus.

CONCLUSION AND FUTURE
WORK
• In the future, we will try to include the Transliteration in
our system.
• We will try to increase the volume of our corpus, such
that we get a much better translation system.
• We will also try to implement the translation process
without using the Moses toolkit

REFERENCES
• “Machine Translation”, [Online]. Available:
http://en.wikipedia.org/wiki/Machine_translation
• “Statistical Machine Translation” , [Online]. Available:
http://en.wikipedia.org/wiki/Statistical_machine_translation
• “Problems in Machine Translation system”, [Online]. Available:
http://languagedirect.org/machine-translation/
• “Machine Translation”, [Online]. Available:
http://faculty.ksu.edu.sa/homiedan/Publications/Machine%20Translation.pdf
• D. D. Rao, “Machine Translation A Gentle Introduction”, RESONANCE, July 1998.
• S.K. Dwivedi and P. P. Sukadeve, “Machine Translation System Indian Perspectives”,
Proceeding of Journal of Computer Science Vol. 6 No. 10. pp 1082-1087, May 2010.

REFERENCES
• P. F. Brown, S. De. Pietra, V. D. Pietra and R. Mercer, “The mathematics of statistical
machine translation: parameter estimation”. “Journal Computational Linguistics”, vol.
10, no.2, June 1993
• “ Natural Language Processing” , [Online]. Available:
http://www.techopedia.com/definition/653/natural-language-processing-nlp

Assamese to English Statistical Machine Translation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Assamese to English Statistical Machine Translation

Ähnlich wie Assamese to English Statistical Machine Translation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Assamese to English Statistical Machine Translation