2. What is Translation
Process of converting text from
one language to another, so
that the original message is
retained in target language.
Source Language = langauge
whose text is to be translated.
Target Language = langauge in
which the text is translated.
3. Who can translate text
Humans
Perfect Translations
Very Expensive
Hard to Find (Require Knowledge of both languages)
Machines
Near Perfect in Domains
Less Expensive as compared to Humans
Can be found at a click of a button
4. What is machine translation?
•Machine translation is automated translation or “translation
carried out by a computer, the first suggestions concerning MT
were made by the Russian Smirnov- Troyansky and the
Frenchman G.B in 1930's
•the first serious discussions were begun in 1946 by the
mathematician Warren Weaver.
•Globalization Create the need of Machine Translation
5. MT IN INDIAN LANGUAGES
• Need –
India is a highly multilingual country with eighteen constitutionally recognized
languages and several hundred dialects & other living languages.
Even though, English is understood by less than 3% of Indian population,
it continues to be the de-facto link language for administration, education
and business.
Hindi, which is official language of the country, is used by more than
400 million people.
MAT in Indian language-
AnglaHindi
Anusaaraka in IIT Kanpur and IIIT Hyderabad.
Mantra
AnglaBharti
6. Core Challenges of MT
Ambiguity:
Human languages are highly ambiguous, and differently
in different languages
Ambiguity at all “levels”: lexical, syntactic, semantic,
language-specific constructions and idioms
Examples-
The word 'light',can mean not very
heavy or not very dark lexically ambiguous.
7. Word Level (Lexical Semantics) => Lexical Semantic Ambiguity
कलम
मैंने नीली कलम से ललखा -- लेखनी
वह कलम द्वारा पत्थर पर राम ललख रहा है -- औज़ार
Please
Please can I come in
I am very pleased with your work
8. Semantic Level –
• This occurs when the meaning of the words themselves can be misinterpreted.
Iraqi head seeks arm
The word head can be a body part or can be a chief of some nation.
Similarly, arms can a body part or can be a plural of weapons
9. मेरा नाम ननशीथ जोशी है My name is Nisheeth Joshi
Give-information+personal-data (name=ननशीथ_जोशी)
[s [vp accusative_pronoun
“नाम” proper_name]]
[s [np [possessive_pronoun
“name”]]
[vp “be” proper_name]]
Direct
Transfer
Interlingua
Analysis Generation
Approaches to MT: Vaquois MT Triangle
10. Direct Approaches
No intermediate stage in the translation
First MT systems developed in the 1950’s-60’s
(assembly code programs)
Morphology, bi-lingual dictionary lookup, local
reordering rules
“Word-for-word, with some local word-order
adjustments”
Modern Approaches: EBMT and SMT
11. Example based machine translation
Example based machine Translation is based on recalling /findindg analogous
examples(of the language pair).
The basic idea of Example-Based Machine Translation (EBMT) is to reuse examples
of already existing translations as the basis for for new translation.
An Example based Machine Translation system is given a set of sentence in the
source language(from which one is translating) and the corresponding translation of
each sentence in the target language with point to point mapping
12. EBMT basis termlogy
database of translation
pairs(Translation memory)
match input against example database
Or existing examples(like Translation
Memory)
identify corresponding translation
fragments (align)and then
recombine fragment into target text
13. EBMT PARADIGM
New Sentence (Source)
Yesterday, 200 delegates met with Prime Minister.
Matches to Source Found
Yesterday, 200
delegates met behind
closed doors…
Difficulties with Prime
Minister…
कल २०० अनिथथ बंद
दरवाजों के पीछें लमले…
प्रधानमंत्री के साथ
कठनाईयों…
Alignment (Sub-sentential)
Translated Sentence (Target)
कल, २०० अतिथि प्रधानमंत्री के साि ममले
Yesterday, 200 delegates
met behind closed
doors…
Difficulties with Prime
Minister over…
कल २०० अतिथि बंद
दरवाजों के पीछें ममले…
प्रधानमंत्री के साि
कठनाईयों पर…
14. What is Statistical Machine Translation?
It was introduced in early 1990s by researchers at IBM's Thomas J. Watson Research
Center
Goal is to produce a target sentence from a source sentence that maximizes the
probability
In statistical machine translation (SMT), translation systems are trained on large
quantities of
parallel data (from which the systems learn how to translate small segments),
monolingual data (from which the systems learn what the target language should
look like).
Phrase-based SMT (Koehn et al. 2003) has emerged as the dominant paradigm in
machine translation research.
Advantages:
Can quickly train for new languages
Can adopt to new domains
Problems:
Need parallel data
All words, even punctuation, are equal
Difficult to pin-point the causes of errors
16. SMT
A document is translated according to the probability distribution
p(e/f) that a string e in the target language (for example, English) is
the translation of a string f in the source language . SMT
translations are generated on the basis of statistical models
whose parameters are derived from the analysis of text corpora
[3].
Statistical MT system is modeled as three separate parts:
language model
translation model
decoder
17. We apply Bayesian approach for this-
ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) }
Language model(LM):assigns a probability to any
target string of words {P(e)}
an LM probability distribution over strings S that attempts to
reflect how frequently a string S occurs as a sentence.
Translation model(TM): assigns a probability to any
pair of target and source strings {P(f|e)}
Decoder: determines translation based on probabilities of
LM & TM
18. Translation Models
1.Word Based Model-
•the fundamental unit of translation is a word
Aligns one word of source language with one word of the target language.
Disadvantage-, the number of words in translated sentences are different, because
of compound words, morphology and idioms.
2.Phrase-based translation
translating whole sequences of words, where the lengths may differ.
The sequences of words are called blocks or phrases, but typically are not linguistic
phrases, but phrasemes found using statistical methods from corpora
It has been shown that restricting the phrases to linguistic phrases (syntactically motivated
groups of words, see syntactic categories) decreases the quality of translation.
वह बाजार जा रहा है
He is going to the market
19. 3. Syntax-based translation
Syntax-based translation is based on the idea of translating syntactic units,
rather than single words or strings of words
synchronous context-free grammars.-modeling the reordering of clauses that
occurs when translating a sentence by correspondences between phrase-
structure rules in the source and target languages.
20. 4.Hierarchical phrase-based translation
Hierarchical phrase-based translation combines the strengths of
phrase-based and syntax-based translation.
It uses synchronous context-free grammar rules, but the grammars
may be constructed by an extension of methods for phrase-based
translation without reference to linguistically motivated syntactic
constituents
21. What is Moses?
It is an open source toolkit
Toolkit for (SMT)Statistical Machine Translation
Moses is under LGPL license
Moses distribution uses external open source tools
word alignment: giza++, mgiza, BerkeleyAligner, FastAlign
language model: srilm, irstlm, randlm, kenlm
scoring: bleu, ter, meteor
GIZA++ It is used for making word-alignments
This toolkit is an implementation of the original IBM Models that
started machine translation research.
SRILM- It is used for language modeling
22. Other Open Source MT Systems
• Joshua — Johns Hopkins University
http://joshua.sourceforge.net/
• CDec — University of Maryland
http://cdec-decoder.org/
• Jane — RWTH Aachen
http://www.hltpr.rwth-aachen.de/jane/
• Phrasal — Stanford University
http://nlp.stanford.edu/phrasal/
• Very similar technology
– Joshua and Phrasal implemented in Java, others in C++
– Joshua supports only tree-based models
– Phrasal supports only phrase-based models
• Open sourcing tools increasing trend in NLP research
23. HISTORY OF MOSES
• 2005 Hieu Hoang (then student of Philipp Koehn) starts Moses as successor to
Pharoah
• 2006 Moses is the subject of the JHU workshop, first check-in to public repository
• 2006 Start of Euromatrix, EU project which helps fund Moses development
• 2007 First machine translation marathon held in Edinburgh
• 2009 Moses receives support from EuromatrixPlus, also EU-funded
• 2010 Moses now supports hierarchical and syntax-based models, using chart decoding
• 2011 Moses moves from sourceforge to github, after over 4000 sourceforge check-ins
• 2012 EU-funded MosesCore launched to support continued development of Moses
24. Moses Translation Process
It involves Segmenting the source sentence into
source phrases
Translating each source phrase into a target
phrase & optionally reordering the target phrases
into a target sentence.
वह बाजार जा रहा है
He is going to the market
• Foreign input is segmented in phrases
• Each phrase is translated into English
• Phrases are reordered
25. WHAT DOES MOSES DO?
Hindi Eng
Parallel Corpus
Moses
Training
SMT Model
Moses.ini
Target
lang(mono.
hi)
Lang. Model
Decoder
Source Sentence
Target Sentence
26. WORKFLOW FOR BUILDING A
PHRASE BASED SMT SYSTEM
• Corpus preparation: Train, Tune and Test split
• Pre-processing: Normalization, tokenization, etc.
• Training: Learn Phrase tables from Training set
• Tuning: Learn weights of discriminative model on
• Tuning set
• Testing: Decode Test set using tuned data
• Post-processing: regenerating case, re-ranking
• Evaluation: Automated Metrics or human evaluation
27. Components of Moses
Training pipeline-It consist of collection of tools
(mainly written in perl, with some in C++) which take the raw data
(parallel and monolingual) and turn it into a machine translation
model
Decoder-single C++ application which, given a trained
machine translation model and a source sentence, will translate the
source sentence into the target language.
a variety of contributed tools and utilities like GIZA++ and
SRILM
28. Training in Moses
1. Prepare data
Training data has to be provided sentence aligned
in two files, one for the foreign sentences, one for the English sentences
The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. Two
vocabulary files are generated and the parallel corpus is converted into a numberized format.
The vocabulary files contain words, integer word identifiers and word count information
2. Run GIZA++
We need it as a initial step to establish word alignments.
29. 3. Align words
To establish word alignments based on the two GIZA++ alignments, a number of heuristics
may be applied.
4. Get lexical translation table
estimate a maximum likelihood lexical translation table.
5. Extract phrases -all phrases are dumped into one big file
6. Score phrases -estimate the phrase translation probability (ejf)
7. Build lexicalized reordering model
8. Build generation models-The generation model is build from the
target side of the parallel corpus.
9. Create Configuration File-As a final step, a configuration file
for the decoder is generated with all the correct paths for
the generated model and a number of default parameter settings
30. The Decoder
The job of the Moses decoder is to find the highest scoring sentence in the
target language (according to the translation model) corresponding to a given
source sentence.
The decoder is written in a modular fashion and allows the user to vary the
decoding process in various ways, such as:
Input: This can be a plain sentence, or annotated xml-like elements or complex structure
like a lattice or confusion network Translation model: This can use phrase-phrase rules, or
hierarchical (perhaps syntactic) rules.
Decoding algorithm: Decoding is a huge search problem, generally too big for exact
search, and Moses implements several different strategies for this search, such as stack-based,
cube-pruning, chart parsing etc.
Language model: Moses supports several different language model toolkits (SRILM,
KenLM, IRSTLM, RandLM) each of which has there own strengths and weaknesses, and adding a
new LM toolkit is straightforward.
Translation Model:Uses phrase or hierarchical based models.It can be supplemented with
features to add extra information to the translation process.
31. Decoder Language Models
works with the following language models:
SRI language model-
SRILM is a toolkit for building and applying statistical language
models (LMs), primarily for use in speech recognition, statistical
tagging and segmentation, and machine translation
It has been under development in the SRI Speech Technology and
Research Laboratory since 1995.
IRST language model- The IRST Language Modeling Toolkit
features algorithms and data structures suitable to estimate, store,
and
access very large LMs.
IRSTLM toolkit handles LM formats which permit to reduce both
storage and decoding memory requirements, and to save time in
32. RandLM
build the largest LMs possible (for example, a 5-gram trained on one hundred billion
words ).
It represents LMs using a randomized data structure
This can result in LMs that are ten times smaller than those created using the SRILM
(and also smaller than IRSTLM),
but at the cost of making decoding about four times slower.
It is multithreaded .
KenLM
is a language model that is simultaneously fast and low memory.
The probabilities returned are the same as SRI, up to floating point rounding.
It is maintained by Ken Heafield, who provides additional information on his website.
KenLM is distributed with Moses and compiled by default. KenLM is fully thread-safe for
use with multi-threaded Moses.
KenLM is included by default in moses
33. Contributed Tools
Moses Server- provides an xml-rpc interface to
the decoder
Web translation- set of scripts to translate
webpage
Analysis tools- scripts to enable and analyze the
visualization of Moses output
34. Moses Platform
Primary development platform for Moses is
Linux.
& recommended platform is Linux since it is
easier to get support for it.
However it works on other platforms also.
36. Work Done
Aim-To collect parallel corpus of English/Hindi
Step 1.
Data collection-Data is in pdf form
1. Bilingual data
2. English text available in pdf can easily transfer from one format to another format.
.Step 2.
Problem - Hindi text cannot be directly copied because of different font’s style
Solution
We will change the format of data. We are converting pdf file in jpeg form
But sometimes when we check the translation of some words are incorrect in
translation .Then we collect errors, noted own and try to correct them. Therefore we
perform Optical character recognition.
No of pdfs files given-9
No of pdf on which OCR is performed-6
Total phrases updated-200
No of files uploaded-4
Duplicacy-5-10 phrases
OCR of P. Chidambaram bhasan.
37. MOSES INSTALLATION
1. Download Moses- Moses is downloaded from github.
2. Basic SetUp-
To compile Moses, you need the following installed on your
machine: g++,Boost
3.After installing these, we need to download and install moses.
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/
./bjam -j4
38. MANUALLY INSTALLING BOOST
• wget
http://downloads.sourceforge.net/project/boost/boost/1.55.0/boost_1_55_0.tar.gz
• tar zxvf boost_1_55_0.tar.gz
• cd boost_1_55_0/
• ./bootstrap.sh –prefix=/home/angla/SMT/BOOST_HOME
• ./b2 -j4 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install
|| echo FAILURE
• Once boost is installed, you can then compile Moses. However, you must tell Moses
where boost is with the --with-boost flag.
• ./bjam --with-boost=/home/angla/SMT/boost_1_55_0 -j4
39. OTHER SOFTWARE TO INSTALL
• Word Alignment- we have installed MGIZA because it is multi-threaded and give general good
result.
• Language Model Creation-
• Moses includes the KenLM language model creation program.
• We have used IRSTLM Language model.
• Other software's- su yum install [package name]
• Packages:
• git
• subversion
• automake
• libtool
• gcc-c++
• zlib-devel
• python-devel
• bzip2-devel
40. INSTALLING IRSTLM
• IRSTLM is a language modelling toolkit from FBK
• install IRSTLM-
• tar zxvf irstlm-5.80.03.tgz
• cd irstlm-5.80.03
• sh regenerate-makefiles.sh
• ./configure --prefix=/home/angla/SMT/IRSTLM_HOME
• make install
41. INSTALLING MGIZA
• Download mgiza from https://github.com/moses-smt/mgiza
• Build mgiza-
• cd mgiza/mgizapp
• cmake.
• make
• make install
Note-we need to build boost first.
Compiling Moses-
./bjam –with-irstlm=/home/angla/SMT/IRSTLM_HOME /
–with-boost=/home/angla/SMT/boost_1_55_0
42. CORPUS PREPARATION
• To train a translation system we need parallel data (text translated in Hindi
and English) which is aligned at the sentence level.
• To prepare the data for training the translation system, we have to
perform the following steps:
• tokenisation: This means that spaces have to be inserted between
(e.g.) words and punctuation.
• truecasing: The initial words in each sentence are converted to their
most probable casing. This helps reduce data sparsity.
• cleaning: Long sentences and empty sentences are removed as they
can cause problems with the training pipeline, and obviously mis-aligned
sentences are removed.
44. LANGUAGE MODEL TRAINING
• The language model (LM) is used to ensure fluent output, so it is
built with the target language (i.e English in this case)
• The language model should be trained on a corpus that is suitable to
the domain. If the translation model is trained on a parallel corpus,
then the language model should be trained on the output side of that
corpus.We do 3-gram language model, removing singletons,
smoothing with improved Kneser-Ney, and adding sentence
boundary symbols
• mkdir ~/SMT/lm
• cd ~/lm
• ~/SMT/IRSTLM_HOME/bin/add-start-end.sh <
~/SMT/corpus/train.true.en > train.sb.en
• export IRSTLM=$HOME/angla/SMT/irstlm;
45. ~/irstlm/bin/compile-lm --text=yes train.lm.en.gz train.arpa.en
Binary Language Models
This format can be properly managed through the compile-lm
command in order to produce a compiled version or a standard ARPA
version of the LM.
Building Huge Language Models
LM estimation starts with the collection of n-grams and their frequency
counters. Then, smoothing parameters are estimated for each n-gram level;
infrequent n-grams are possibly pruned and, finally, a LM file is created
containing n-grams with probabilities and back-off weights.
We use the script build-lm.sh
~/irstlm/bin/build-lm.sh -I train.sb.e
n -t ./tmp -p -s improved-kneser-ney -o train.lm.en
The script builds a 3-gram LM (option -n) from the specified input command
(-i), by splitting the training procedure into 10 steps (-k). The LM will be
saved in the output (-o) file train.irstlm.gz with an intermediate ARPA
format.
46. TRAINING THE TRANSLATION
SYSTEM
• To do this, we run word-alignment (using GIZA++),
• phrase extraction
• and scoring,
• create lexicalised reordering tables
• and create your Moses configuration file, all with a single
command
• mkdir ~/working
• cd ~/working
• nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train -
/SMT/corpus ~/corpus/train.clean -f hi -e en -alignment grow-diag-final-and
-reordering msd-bidirectional-fe -lm 0:3:~/SMT/lm/train.blm.en:8 -mgiza
-external-bin-dir ~/mosesdecoder/tools >& training.out &
• Once it's finished there should be a moses.ini file in the directory
~/working/train/model.
47. STEP 1
• Create a parallel corpus: one sentence per line format
Step 2
• Run plain2snt.out located within the GIZA++
package
•./plain2snt.out hindi english
• Files created by plain2snt
• train-en.vcb
• train-hi.vcb
• train-en-train-hi.snt
48. FILES CREATED BY plain2snt
• english.vcb consists of:
• each word from the english corpus
• corresponding frequency count for each word
• an unique id for each word
• french.vcb
• each word from the french corpus
• corresponding frequency count for each word
• an unique id for each word
• frenchenglish.snt consists of:
• each sentence from the parallel english and french corpi translated
into the unique number for each word
49. CREATE MKCLS FILES NEEDED
FOR GIZA++:
Run _mkcls which is not located within the GIZA++ package
•mkcls –pengish –Venglish.vcb.classes
•mkcls –pfrench –Vhindi.vcb.classes
Files created by _mkcls
• english.vcb.classes
• english.vcb.classes.cats
• hindi.vcb.classes
• hindi.vcb.classes.cats
• .vcb.classes files contains: • an alphabetical list of all words (including
punctuation) • each words corresponding frequency count •
.vcb.classes.cats files contains • a list of frequencies • a set of words for
that corresponding frequency
50. TUNING
• During decoding, Moses scores translation hypotheses using a linear model.
• In the traditional approach, the features of the model are the probabilities
from the language models, phrase/rule tables, and reordering models, plus
word, phrase and rule counts. Recent versions of Moses support the
augmentation of these core features with sparse features, which may be
much more numerous.
• Tuning refers to the process of finding the optimal weights for this linear
model, where optimal weights are those which maximise translation
performance on a small set of parallel sentences (the tuning set).
• Translation performance is usually measured with Bleu, but the tuning
algorithms all support (at least in principle) the use of other performance
measures like Minimum error rate training (MERT).
• This line-search based method is probably still the most widely used tuning
algorithm, and the default option in Moses.
51. TUNING COMMANDS
• Tuning requires a small amount of parallel data, separate from
the training data.
• cd ~/working
• nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl
• ~/corpus/news-test2008.true.fr ~/corpus/news-test2008.true.en
• ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir
~/mosesdecoder/bin/
• &> mert.out &
• The end result of tuning is an ini file with trained weights, which
should be in ~/working/mert- work/moses.ini
52. TESTING
• ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini
• In order to make it start quickly, we can binarise the phrase-
table and lexicalised reordering models. To do this, create a
suitable directory and binarise the models as follows:
• mkdir ~/working/binarised-model
• cd ~/working
• ~/mosesdecoder/bin/processPhraseTableMin
-in train/model/phrase-table.gz -nscores 4
-out binarised-model/phrase-table
In order to bring processPhraseTableMin we need to compile moses with
cmph.
53. COMPILING MOSES WITH CMPH
• Download cmph from http://sourceforge.net/projects/cmph/
• Install cmph by-
• cd cmph 2.0
• ./configure to configure package for system.
• make check
• Make install
• Compile with boost-
• Cd mosesdecoder.v211
• ./bjam –with-cmph=~/SMT/cmph2.0
55. PROBLEMS FACED DURING
INSTALLATION
• ./bjam requires boost 104400 but we have 104100.
• Reason- boost not installed to specific folder.
• Solution-install boost
56. Linux commands
tar command-
It deal with tape drives backup.
The tar command used to rip a collection of files and directories into high
tar -cvf tecmint-14-09-12.tar /home/tecmint/
Let’s discuss the each option we have used in the above command for crea
c – Creates a new .tar archive file.
v – Verbosely show the .tar file progress.
f – File name type of the archive file.
57. 2. Command: ls
The command “ls” stands for (List Directory Contents),
List the contents of the folder, be it file or folder, from which it runs.
3.uname
The “uname” command stands for (Unix Name), print detailed information
about the machine name, Operating System and Kernel.
4.Command: history-
Shows history of commands executed in kernel.
58. 5.apt-get-
perform installation of new software packages, removing existing software
packages, upgrading of existing software packages and even used to
upgrading the entire operating system
apt-cache pkgnames
6.grep command examples
Search for a given string in a file (case in-sensitive search).
$ grep -i "the" demo_file
7.vim command
Go to the 143rd line of file
$ vim +143 filename.txt
8.cd command
Use “cd -” to toggle between the last two directories
59. 9. free command examples
This command is used to display the free, used, swap memory
available in the system.
10.cat command examples
You can view multiple files at the same time. Following example
prints the content of file1 followed by file2 to stdout.
$ cat file1 file2
11.chmod command
chmod command is used to change the permissions for a file
or directory.
Give full access to user and group (i.e read, write and execute )
on a specific file.
$ chmod ug+rwx file.txt
60. Importance of Moses
Moses is an installable software unlike other
online-only translation systems
Online systems cannot be trained on our own
data
There is also a problem with privacy, if you have
to translate sensitive info.
61. Conclusion
Moses is an open source toolkit, so that the
users can modify and customize the toolkit based
on their needs and requirements.
62. Reference
Mt paper of Harold somer
www.statmt.org/moses
Moses: Open Source Toolkit for Statistical Machine Translation by Philipp
Koehn
Statistical Machine Translation by Philipp Kohen
AnglaHindi:
An English to Hindi Machine-Aided Translation Systemby R.M.K. Sinha