SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Machine Translation
Version 3
Submitted By-
Khyati Gupta, Rakhi Sharma
Submitted On- 22 Jul 2015
What is Translation
Process of converting text from
one language to another, so
that the original message is
retained in target language.
Source Language = langauge
whose text is to be translated.
Target Language = langauge in
which the text is translated.
Who can translate text
Humans
Perfect Translations
Very Expensive
Hard to Find (Require Knowledge of both languages)
Machines
Near Perfect in Domains
Less Expensive as compared to Humans
Can be found at a click of a button
What is machine translation?
•Machine translation is automated translation or “translation
carried out by a computer, the first suggestions concerning MT
were made by the Russian Smirnov- Troyansky and the
Frenchman G.B in 1930's
•the first serious discussions were begun in 1946 by the
mathematician Warren Weaver.
•Globalization Create the need of Machine Translation
MT IN INDIAN LANGUAGES
• Need –
 India is a highly multilingual country with eighteen constitutionally recognized
 languages and several hundred dialects & other living languages.
 Even though, English is understood by less than 3% of Indian population,
 it continues to be the de-facto link language for administration, education
 and business.
 Hindi, which is official language of the country, is used by more than
 400 million people.
MAT in Indian language-
 AnglaHindi
 Anusaaraka in IIT Kanpur and IIIT Hyderabad.
 Mantra
 AnglaBharti
Core Challenges of MT
Ambiguity:
Human languages are highly ambiguous, and differently
in different languages
Ambiguity at all “levels”: lexical, syntactic, semantic,
language-specific constructions and idioms
Examples-
The word 'light',can mean not very
heavy or not very dark lexically ambiguous.
Word Level (Lexical Semantics) => Lexical Semantic Ambiguity
कलम
मैंने नीली कलम से ललखा -- लेखनी
वह कलम द्वारा पत्थर पर राम ललख रहा है -- औज़ार
Please
Please can I come in
I am very pleased with your work
Semantic Level –
• This occurs when the meaning of the words themselves can be misinterpreted.
Iraqi head seeks arm
The word head can be a body part or can be a chief of some nation.
Similarly, arms can a body part or can be a plural of weapons
मेरा नाम ननशीथ जोशी है My name is Nisheeth Joshi
Give-information+personal-data (name=ननशीथ_जोशी)
[s [vp accusative_pronoun
“नाम” proper_name]]
[s [np [possessive_pronoun
“name”]]
[vp “be” proper_name]]
Direct
Transfer
Interlingua
Analysis Generation
Approaches to MT: Vaquois MT Triangle
Direct Approaches
No intermediate stage in the translation
First MT systems developed in the 1950’s-60’s
(assembly code programs)
Morphology, bi-lingual dictionary lookup, local
reordering rules
“Word-for-word, with some local word-order
adjustments”
Modern Approaches: EBMT and SMT
Example based machine translation
 Example based machine Translation is based on recalling /findindg analogous
examples(of the language pair).
 The basic idea of Example-Based Machine Translation (EBMT) is to reuse examples
of already existing translations as the basis for for new translation.
 An Example based Machine Translation system is given a set of sentence in the
source language(from which one is translating) and the corresponding translation of
each sentence in the target language with point to point mapping
EBMT basis termlogy
database of translation
pairs(Translation memory)
match input against example database
Or existing examples(like Translation
Memory)
identify corresponding translation
fragments (align)and then
recombine fragment into target text
EBMT PARADIGM
New Sentence (Source)
Yesterday, 200 delegates met with Prime Minister.
Matches to Source Found
Yesterday, 200
delegates met behind
closed doors…
Difficulties with Prime
Minister…
कल २०० अनिथथ बंद
दरवाजों के पीछें लमले…
प्रधानमंत्री के साथ
कठनाईयों…
Alignment (Sub-sentential)
Translated Sentence (Target)
कल, २०० अतिथि प्रधानमंत्री के साि ममले
Yesterday, 200 delegates
met behind closed
doors…
Difficulties with Prime
Minister over…
कल २०० अतिथि बंद
दरवाजों के पीछें ममले…
प्रधानमंत्री के साि
कठनाईयों पर…
What is Statistical Machine Translation?
It was introduced in early 1990s by researchers at IBM's Thomas J. Watson Research
Center
Goal is to produce a target sentence from a source sentence that maximizes the
probability
In statistical machine translation (SMT), translation systems are trained on large
quantities of
parallel data (from which the systems learn how to translate small segments),
monolingual data (from which the systems learn what the target language should
look like).
Phrase-based SMT (Koehn et al. 2003) has emerged as the dominant paradigm in
machine translation research.
Advantages:
Can quickly train for new languages
Can adopt to new domains
Problems:
Need parallel data
All words, even punctuation, are equal
Difficult to pin-point the causes of errors
SMT
SMT
A document is translated according to the probability distribution
p(e/f) that a string e in the target language (for example, English) is
the translation of a string f in the source language . SMT
translations are generated on the basis of statistical models
whose parameters are derived from the analysis of text corpora
[3].
Statistical MT system is modeled as three separate parts:
language model
translation model
decoder
We apply Bayesian approach for this-
ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) }
Language model(LM):assigns a probability to any
target string of words {P(e)}
an LM probability distribution over strings S that attempts to
reflect how frequently a string S occurs as a sentence.
Translation model(TM): assigns a probability to any
pair of target and source strings {P(f|e)}
Decoder: determines translation based on probabilities of
LM & TM
Translation Models
1.Word Based Model-
•the fundamental unit of translation is a word
Aligns one word of source language with one word of the target language.
Disadvantage-, the number of words in translated sentences are different, because
of compound words, morphology and idioms.
2.Phrase-based translation
translating whole sequences of words, where the lengths may differ.
The sequences of words are called blocks or phrases, but typically are not linguistic
phrases, but phrasemes found using statistical methods from corpora
It has been shown that restricting the phrases to linguistic phrases (syntactically motivated
groups of words, see syntactic categories) decreases the quality of translation.
 वह बाजार जा रहा है
 He is going to the market
3. Syntax-based translation
Syntax-based translation is based on the idea of translating syntactic units,
rather than single words or strings of words
synchronous context-free grammars.-modeling the reordering of clauses that
occurs when translating a sentence by correspondences between phrase-
structure rules in the source and target languages.
4.Hierarchical phrase-based translation
Hierarchical phrase-based translation combines the strengths of
phrase-based and syntax-based translation.
It uses synchronous context-free grammar rules, but the grammars
may be constructed by an extension of methods for phrase-based
translation without reference to linguistically motivated syntactic
constituents
What is Moses?
It is an open source toolkit
Toolkit for (SMT)Statistical Machine Translation
Moses is under LGPL license
Moses distribution uses external open source tools
 word alignment: giza++, mgiza, BerkeleyAligner, FastAlign
 language model: srilm, irstlm, randlm, kenlm
 scoring: bleu, ter, meteor
GIZA++ It is used for making word-alignments
This toolkit is an implementation of the original IBM Models that
started machine translation research.
 SRILM- It is used for language modeling
Other Open Source MT Systems
• Joshua — Johns Hopkins University
http://joshua.sourceforge.net/
• CDec — University of Maryland
http://cdec-decoder.org/
• Jane — RWTH Aachen
http://www.hltpr.rwth-aachen.de/jane/
• Phrasal — Stanford University
http://nlp.stanford.edu/phrasal/
• Very similar technology
– Joshua and Phrasal implemented in Java, others in C++
– Joshua supports only tree-based models
– Phrasal supports only phrase-based models
• Open sourcing tools increasing trend in NLP research
HISTORY OF MOSES
• 2005 Hieu Hoang (then student of Philipp Koehn) starts Moses as successor to
Pharoah
• 2006 Moses is the subject of the JHU workshop, first check-in to public repository
• 2006 Start of Euromatrix, EU project which helps fund Moses development
• 2007 First machine translation marathon held in Edinburgh
• 2009 Moses receives support from EuromatrixPlus, also EU-funded
• 2010 Moses now supports hierarchical and syntax-based models, using chart decoding
• 2011 Moses moves from sourceforge to github, after over 4000 sourceforge check-ins
• 2012 EU-funded MosesCore launched to support continued development of Moses
Moses Translation Process
It involves Segmenting the source sentence into
source phrases
Translating each source phrase into a target
phrase & optionally reordering the target phrases
into a target sentence.
 वह बाजार जा रहा है
 He is going to the market
• Foreign input is segmented in phrases
• Each phrase is translated into English
• Phrases are reordered
WHAT DOES MOSES DO?
Hindi Eng
Parallel Corpus
Moses
Training
SMT Model
Moses.ini
Target
lang(mono.
hi)
Lang. Model
Decoder
Source Sentence
Target Sentence
WORKFLOW FOR BUILDING A
PHRASE BASED SMT SYSTEM
• Corpus preparation: Train, Tune and Test split
• Pre-processing: Normalization, tokenization, etc.
• Training: Learn Phrase tables from Training set
• Tuning: Learn weights of discriminative model on
• Tuning set
• Testing: Decode Test set using tuned data
• Post-processing: regenerating case, re-ranking
• Evaluation: Automated Metrics or human evaluation
Components of Moses
Training pipeline-It consist of collection of tools
(mainly written in perl, with some in C++) which take the raw data
(parallel and monolingual) and turn it into a machine translation
model
Decoder-single C++ application which, given a trained
machine translation model and a source sentence, will translate the
source sentence into the target language.
a variety of contributed tools and utilities like GIZA++ and
SRILM
Training in Moses
1. Prepare data
Training data has to be provided sentence aligned
in two files, one for the foreign sentences, one for the English sentences
The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. Two
vocabulary files are generated and the parallel corpus is converted into a numberized format.
The vocabulary files contain words, integer word identifiers and word count information
2. Run GIZA++
We need it as a initial step to establish word alignments.
3. Align words
To establish word alignments based on the two GIZA++ alignments, a number of heuristics
may be applied.
4. Get lexical translation table
 estimate a maximum likelihood lexical translation table.
5. Extract phrases -all phrases are dumped into one big file
6. Score phrases -estimate the phrase translation probability (ejf)
7. Build lexicalized reordering model
8. Build generation models-The generation model is build from the
target side of the parallel corpus.
9. Create Configuration File-As a final step, a configuration file
for the decoder is generated with all the correct paths for
the generated model and a number of default parameter settings
The Decoder
The job of the Moses decoder is to find the highest scoring sentence in the
target language (according to the translation model) corresponding to a given
source sentence.
The decoder is written in a modular fashion and allows the user to vary the
decoding process in various ways, such as:
Input: This can be a plain sentence, or annotated xml-like elements or complex structure
like a lattice or confusion network Translation model: This can use phrase-phrase rules, or
hierarchical (perhaps syntactic) rules.
Decoding algorithm: Decoding is a huge search problem, generally too big for exact
search, and Moses implements several different strategies for this search, such as stack-based,
cube-pruning, chart parsing etc.
Language model: Moses supports several different language model toolkits (SRILM,
KenLM, IRSTLM, RandLM) each of which has there own strengths and weaknesses, and adding a
new LM toolkit is straightforward.
Translation Model:Uses phrase or hierarchical based models.It can be supplemented with
features to add extra information to the translation process.
Decoder Language Models
works with the following language models:
SRI language model-
SRILM is a toolkit for building and applying statistical language
models (LMs), primarily for use in speech recognition, statistical
tagging and segmentation, and machine translation
It has been under development in the SRI Speech Technology and
Research Laboratory since 1995.
IRST language model- The IRST Language Modeling Toolkit
features algorithms and data structures suitable to estimate, store,
and
access very large LMs.
IRSTLM toolkit handles LM formats which permit to reduce both
storage and decoding memory requirements, and to save time in
RandLM
build the largest LMs possible (for example, a 5-gram trained on one hundred billion
words ).
It represents LMs using a randomized data structure
This can result in LMs that are ten times smaller than those created using the SRILM
(and also smaller than IRSTLM),
 but at the cost of making decoding about four times slower.
It is multithreaded .
KenLM
is a language model that is simultaneously fast and low memory.
The probabilities returned are the same as SRI, up to floating point rounding.
 It is maintained by Ken Heafield, who provides additional information on his website.
KenLM is distributed with Moses and compiled by default. KenLM is fully thread-safe for
use with multi-threaded Moses.
KenLM is included by default in moses
Contributed Tools
Moses Server- provides an xml-rpc interface to
the decoder
Web translation- set of scripts to translate
webpage
Analysis tools- scripts to enable and analyze the
visualization of Moses output
Moses Platform
Primary development platform for Moses is
Linux.
& recommended platform is Linux since it is
easier to get support for it.
However it works on other platforms also.
Moses Releases
Moses 1.0 (28th Jan 2013)
Moses 0.91 (12th Oct 2012)
Work Done
Aim-To collect parallel corpus of English/Hindi
Step 1.
Data collection-Data is in pdf form
1. Bilingual data
2. English text available in pdf can easily transfer from one format to another format.
.Step 2.
Problem - Hindi text cannot be directly copied because of different font’s style
Solution
We will change the format of data. We are converting pdf file in jpeg form
But sometimes when we check the translation of some words are incorrect in
translation .Then we collect errors, noted own and try to correct them. Therefore we
perform Optical character recognition.
No of pdfs files given-9
No of pdf on which OCR is performed-6
Total phrases updated-200
No of files uploaded-4
Duplicacy-5-10 phrases
OCR of P. Chidambaram bhasan.
MOSES INSTALLATION
1. Download Moses- Moses is downloaded from github.
2. Basic SetUp-
To compile Moses, you need the following installed on your
machine: g++,Boost
3.After installing these, we need to download and install moses.
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder/
./bjam -j4
MANUALLY INSTALLING BOOST
• wget
http://downloads.sourceforge.net/project/boost/boost/1.55.0/boost_1_55_0.tar.gz
• tar zxvf boost_1_55_0.tar.gz
• cd boost_1_55_0/
• ./bootstrap.sh –prefix=/home/angla/SMT/BOOST_HOME
• ./b2 -j4 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install
|| echo FAILURE
• Once boost is installed, you can then compile Moses. However, you must tell Moses
where boost is with the --with-boost flag.
• ./bjam --with-boost=/home/angla/SMT/boost_1_55_0 -j4
OTHER SOFTWARE TO INSTALL
• Word Alignment- we have installed MGIZA because it is multi-threaded and give general good
result.
• Language Model Creation-
• Moses includes the KenLM language model creation program.
• We have used IRSTLM Language model.
• Other software's- su yum install [package name]
• Packages:
• git
• subversion
• automake
• libtool
• gcc-c++
• zlib-devel
• python-devel
• bzip2-devel
INSTALLING IRSTLM
• IRSTLM is a language modelling toolkit from FBK
• install IRSTLM-
• tar zxvf irstlm-5.80.03.tgz
• cd irstlm-5.80.03
• sh regenerate-makefiles.sh
• ./configure --prefix=/home/angla/SMT/IRSTLM_HOME
• make install
INSTALLING MGIZA
• Download mgiza from https://github.com/moses-smt/mgiza
• Build mgiza-
• cd mgiza/mgizapp
• cmake.
• make
• make install
Note-we need to build boost first.
Compiling Moses-
./bjam –with-irstlm=/home/angla/SMT/IRSTLM_HOME /
–with-boost=/home/angla/SMT/boost_1_55_0
CORPUS PREPARATION
• To train a translation system we need parallel data (text translated in Hindi
and English) which is aligned at the sentence level.
• To prepare the data for training the translation system, we have to
perform the following steps:
• tokenisation: This means that spaces have to be inserted between
(e.g.) words and punctuation.
• truecasing: The initial words in each sentence are converted to their
most probable casing. This helps reduce data sparsity.
• cleaning: Long sentences and empty sentences are removed as they
can cause problems with the training pipeline, and obviously mis-aligned
sentences are removed.
TOKENIZATION
• ~/SMT/mosesdecoderv211/scripts/tokenizer/tokenizer.perl -l en <
~/SMT/corpus/training/train.en > ~/SMT/corpus/train.tok.en
• ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi 
< ~/ SMT/corpus/training/train.hi 
> ~/ SMT/corpus/training/train.tok.hi
TRUECASING
• ~/SMT/mosesdecoderv211/scripts/recaser/train-truecaser.perl --model ~/corpus/truecase-model.en
--corpus <~/SMT/corpus/train.tok.en>~/SMT/corpus/train.true.en
• ~/SMT/mosesdecoderv211/scripts/recaser/train-truecaser.perl 
--model ~/corpus/truecase-model.fr --corpus <~/corpus/train.tok.hi>~/SMT/corpus/train.true.hi
Cleaning
• ~/SMT/mosesdecoder/scripts/training/clean-corpus-n.perl ~/corpus/train.true en hi
~/SMT/corpus/train.clean 1 80
LANGUAGE MODEL TRAINING
• The language model (LM) is used to ensure fluent output, so it is
built with the target language (i.e English in this case)
• The language model should be trained on a corpus that is suitable to
the domain. If the translation model is trained on a parallel corpus,
then the language model should be trained on the output side of that
corpus.We do 3-gram language model, removing singletons,
smoothing with improved Kneser-Ney, and adding sentence
boundary symbols
• mkdir ~/SMT/lm
• cd ~/lm
• ~/SMT/IRSTLM_HOME/bin/add-start-end.sh <
~/SMT/corpus/train.true.en > train.sb.en
• export IRSTLM=$HOME/angla/SMT/irstlm;
~/irstlm/bin/compile-lm --text=yes train.lm.en.gz train.arpa.en
Binary Language Models
This format can be properly managed through the compile-lm
command in order to produce a compiled version or a standard ARPA
version of the LM.
Building Huge Language Models
LM estimation starts with the collection of n-grams and their frequency
counters. Then, smoothing parameters are estimated for each n-gram level;
infrequent n-grams are possibly pruned and, finally, a LM file is created
containing n-grams with probabilities and back-off weights.
We use the script build-lm.sh
~/irstlm/bin/build-lm.sh -I train.sb.e
n -t ./tmp -p -s improved-kneser-ney -o train.lm.en
The script builds a 3-gram LM (option -n) from the specified input command
(-i), by splitting the training procedure into 10 steps (-k). The LM will be
saved in the output (-o) file train.irstlm.gz with an intermediate ARPA
format.
TRAINING THE TRANSLATION
SYSTEM
• To do this, we run word-alignment (using GIZA++),
• phrase extraction
• and scoring,
• create lexicalised reordering tables
• and create your Moses configuration file, all with a single
command
• mkdir ~/working
• cd ~/working
• nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train -
/SMT/corpus ~/corpus/train.clean -f hi -e en -alignment grow-diag-final-and
-reordering msd-bidirectional-fe  -lm 0:3:~/SMT/lm/train.blm.en:8 -mgiza
-external-bin-dir ~/mosesdecoder/tools >& training.out &
• Once it's finished there should be a moses.ini file in the directory
~/working/train/model.
STEP 1
• Create a parallel corpus: one sentence per line format
Step 2
• Run plain2snt.out located within the GIZA++
package
•./plain2snt.out hindi english
• Files created by plain2snt
• train-en.vcb
• train-hi.vcb
• train-en-train-hi.snt
FILES CREATED BY plain2snt
• english.vcb consists of:
• each word from the english corpus
• corresponding frequency count for each word
• an unique id for each word
• french.vcb
• each word from the french corpus
• corresponding frequency count for each word
• an unique id for each word
• frenchenglish.snt consists of:
• each sentence from the parallel english and french corpi translated
into the unique number for each word
CREATE MKCLS FILES NEEDED
FOR GIZA++:
Run _mkcls which is not located within the GIZA++ package
•mkcls –pengish –Venglish.vcb.classes
•mkcls –pfrench –Vhindi.vcb.classes
Files created by _mkcls
• english.vcb.classes
• english.vcb.classes.cats
• hindi.vcb.classes
• hindi.vcb.classes.cats
• .vcb.classes files contains: • an alphabetical list of all words (including
punctuation) • each words corresponding frequency count •
.vcb.classes.cats files contains • a list of frequencies • a set of words for
that corresponding frequency
TUNING
• During decoding, Moses scores translation hypotheses using a linear model.
• In the traditional approach, the features of the model are the probabilities
from the language models, phrase/rule tables, and reordering models, plus
word, phrase and rule counts. Recent versions of Moses support the
augmentation of these core features with sparse features, which may be
much more numerous.
• Tuning refers to the process of finding the optimal weights for this linear
model, where optimal weights are those which maximise translation
performance on a small set of parallel sentences (the tuning set).
• Translation performance is usually measured with Bleu, but the tuning
algorithms all support (at least in principle) the use of other performance
measures like Minimum error rate training (MERT).
• This line-search based method is probably still the most widely used tuning
algorithm, and the default option in Moses.
TUNING COMMANDS
• Tuning requires a small amount of parallel data, separate from
the training data.
• cd ~/working
• nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl 
• ~/corpus/news-test2008.true.fr ~/corpus/news-test2008.true.en 
• ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir
~/mosesdecoder/bin/ 
• &> mert.out &
• The end result of tuning is an ini file with trained weights, which
should be in ~/working/mert- work/moses.ini
TESTING
• ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini
• In order to make it start quickly, we can binarise the phrase-
table and lexicalised reordering models. To do this, create a
suitable directory and binarise the models as follows:
• mkdir ~/working/binarised-model
• cd ~/working
• ~/mosesdecoder/bin/processPhraseTableMin 
-in train/model/phrase-table.gz -nscores 4 
-out binarised-model/phrase-table
In order to bring processPhraseTableMin we need to compile moses with
cmph.
COMPILING MOSES WITH CMPH
• Download cmph from http://sourceforge.net/projects/cmph/
• Install cmph by-
• cd cmph 2.0
• ./configure to configure package for system.
• make check
• Make install
• Compile with boost-
• Cd mosesdecoder.v211
• ./bjam –with-cmph=~/SMT/cmph2.0
BINARISING REORDERING
TABLE
• ~/mosesdecoder/bin/processLexicalTableMin 
• -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz

• -out binarised-model/reordering-table
• ERROR!!!! Not in correct format !!!
PROBLEMS FACED DURING
INSTALLATION
• ./bjam requires boost 104400 but we have 104100.
• Reason- boost not installed to specific folder.
• Solution-install boost
Linux commands
tar command-
It deal with tape drives backup.
The tar command used to rip a collection of files and directories into high
tar -cvf tecmint-14-09-12.tar /home/tecmint/
Let’s discuss the each option we have used in the above command for crea
c – Creates a new .tar archive file.
v – Verbosely show the .tar file progress.
f – File name type of the archive file.
2. Command: ls
The command “ls” stands for (List Directory Contents),
List the contents of the folder, be it file or folder, from which it runs.
3.uname
The “uname” command stands for (Unix Name), print detailed information
about the machine name, Operating System and Kernel.
4.Command: history-
Shows history of commands executed in kernel.
5.apt-get-
perform installation of new software packages, removing existing software
packages, upgrading of existing software packages and even used to
upgrading the entire operating system
apt-cache pkgnames
6.grep command examples
Search for a given string in a file (case in-sensitive search).
$ grep -i "the" demo_file
7.vim command
Go to the 143rd line of file
$ vim +143 filename.txt
8.cd command
Use “cd -” to toggle between the last two directories
9. free command examples
This command is used to display the free, used, swap memory
available in the system.
10.cat command examples
You can view multiple files at the same time. Following example
prints the content of file1 followed by file2 to stdout.
$ cat file1 file2
11.chmod command
chmod command is used to change the permissions for a file
or directory.
Give full access to user and group (i.e read, write and execute )
on a specific file.
$ chmod ug+rwx file.txt
Importance of Moses
Moses is an installable software unlike other
online-only translation systems
Online systems cannot be trained on our own
data
There is also a problem with privacy, if you have
to translate sensitive info.
Conclusion
Moses is an open source toolkit, so that the
users can modify and customize the toolkit based
on their needs and requirements.
Reference
Mt paper of Harold somer
www.statmt.org/moses
Moses: Open Source Toolkit for Statistical Machine Translation by Philipp
Koehn
Statistical Machine Translation by Philipp Kohen
 AnglaHindi:
An English to Hindi Machine-Aided Translation Systemby R.M.K. Sinha
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
Parisa Niksefat
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
RIILP
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation
RIILP
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
RIILP
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation
RIILP
 

Was ist angesagt? (19)

NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
Ijetcas14 444
Ijetcas14 444Ijetcas14 444
Ijetcas14 444
 
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid Approach
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid ApproachPunjabi to Hindi Transliteration System for Proper Nouns Using Hybrid Approach
Punjabi to Hindi Transliteration System for Proper Nouns Using Hybrid Approach
 
Assamese to English Statistical Machine Translation
Assamese to English Statistical Machine TranslationAssamese to English Statistical Machine Translation
Assamese to English Statistical Machine Translation
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation6. Khalil Sima'an (UVA) Statistical Machine Translation
6. Khalil Sima'an (UVA) Statistical Machine Translation
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Machine Tanslation
Machine TanslationMachine Tanslation
Machine Tanslation
 
8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation8. Qun Liu (DCU) Hybrid Solutions for Translation
8. Qun Liu (DCU) Hybrid Solutions for Translation
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
Machine Translation
Machine TranslationMachine Translation
Machine Translation
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 

Ähnlich wie SMT3

Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
Shashank Shisodia
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Waqas Tariq
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
ijnlc
 

Ähnlich wie SMT3 (20)

ReseachPaper
ReseachPaperReseachPaper
ReseachPaper
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
 
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
 
E-Translation
E-TranslationE-Translation
E-Translation
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
 
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
A New Approach: Automatically Identify Naming Word from Bengali Sentence for ...
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
Machine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsMachine Translation Approaches and Design Aspects
Machine Translation Approaches and Design Aspects
 
How to Translate from English to Khmer using Moses
How to Translate from English to Khmer using MosesHow to Translate from English to Khmer using Moses
How to Translate from English to Khmer using Moses
 
Ey4301913917
Ey4301913917Ey4301913917
Ey4301913917
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text EditorDynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
CLUE-Aligner: An Alignment Tool to Annotate Pairs of Paraphrastic and Transla...
 
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
CONSTRUCTION OF ENGLISH-BODO PARALLEL TEXT CORPUS FOR STATISTICAL MACHINE TRA...
 

SMT3

  • 1. Machine Translation Version 3 Submitted By- Khyati Gupta, Rakhi Sharma Submitted On- 22 Jul 2015
  • 2. What is Translation Process of converting text from one language to another, so that the original message is retained in target language. Source Language = langauge whose text is to be translated. Target Language = langauge in which the text is translated.
  • 3. Who can translate text Humans Perfect Translations Very Expensive Hard to Find (Require Knowledge of both languages) Machines Near Perfect in Domains Less Expensive as compared to Humans Can be found at a click of a button
  • 4. What is machine translation? •Machine translation is automated translation or “translation carried out by a computer, the first suggestions concerning MT were made by the Russian Smirnov- Troyansky and the Frenchman G.B in 1930's •the first serious discussions were begun in 1946 by the mathematician Warren Weaver. •Globalization Create the need of Machine Translation
  • 5. MT IN INDIAN LANGUAGES • Need –  India is a highly multilingual country with eighteen constitutionally recognized  languages and several hundred dialects & other living languages.  Even though, English is understood by less than 3% of Indian population,  it continues to be the de-facto link language for administration, education  and business.  Hindi, which is official language of the country, is used by more than  400 million people. MAT in Indian language-  AnglaHindi  Anusaaraka in IIT Kanpur and IIIT Hyderabad.  Mantra  AnglaBharti
  • 6. Core Challenges of MT Ambiguity: Human languages are highly ambiguous, and differently in different languages Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms Examples- The word 'light',can mean not very heavy or not very dark lexically ambiguous.
  • 7. Word Level (Lexical Semantics) => Lexical Semantic Ambiguity कलम मैंने नीली कलम से ललखा -- लेखनी वह कलम द्वारा पत्थर पर राम ललख रहा है -- औज़ार Please Please can I come in I am very pleased with your work
  • 8. Semantic Level – • This occurs when the meaning of the words themselves can be misinterpreted. Iraqi head seeks arm The word head can be a body part or can be a chief of some nation. Similarly, arms can a body part or can be a plural of weapons
  • 9. मेरा नाम ननशीथ जोशी है My name is Nisheeth Joshi Give-information+personal-data (name=ननशीथ_जोशी) [s [vp accusative_pronoun “नाम” proper_name]] [s [np [possessive_pronoun “name”]] [vp “be” proper_name]] Direct Transfer Interlingua Analysis Generation Approaches to MT: Vaquois MT Triangle
  • 10. Direct Approaches No intermediate stage in the translation First MT systems developed in the 1950’s-60’s (assembly code programs) Morphology, bi-lingual dictionary lookup, local reordering rules “Word-for-word, with some local word-order adjustments” Modern Approaches: EBMT and SMT
  • 11. Example based machine translation  Example based machine Translation is based on recalling /findindg analogous examples(of the language pair).  The basic idea of Example-Based Machine Translation (EBMT) is to reuse examples of already existing translations as the basis for for new translation.  An Example based Machine Translation system is given a set of sentence in the source language(from which one is translating) and the corresponding translation of each sentence in the target language with point to point mapping
  • 12. EBMT basis termlogy database of translation pairs(Translation memory) match input against example database Or existing examples(like Translation Memory) identify corresponding translation fragments (align)and then recombine fragment into target text
  • 13. EBMT PARADIGM New Sentence (Source) Yesterday, 200 delegates met with Prime Minister. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Difficulties with Prime Minister… कल २०० अनिथथ बंद दरवाजों के पीछें लमले… प्रधानमंत्री के साथ कठनाईयों… Alignment (Sub-sentential) Translated Sentence (Target) कल, २०० अतिथि प्रधानमंत्री के साि ममले Yesterday, 200 delegates met behind closed doors… Difficulties with Prime Minister over… कल २०० अतिथि बंद दरवाजों के पीछें ममले… प्रधानमंत्री के साि कठनाईयों पर…
  • 14. What is Statistical Machine Translation? It was introduced in early 1990s by researchers at IBM's Thomas J. Watson Research Center Goal is to produce a target sentence from a source sentence that maximizes the probability In statistical machine translation (SMT), translation systems are trained on large quantities of parallel data (from which the systems learn how to translate small segments), monolingual data (from which the systems learn what the target language should look like). Phrase-based SMT (Koehn et al. 2003) has emerged as the dominant paradigm in machine translation research. Advantages: Can quickly train for new languages Can adopt to new domains Problems: Need parallel data All words, even punctuation, are equal Difficult to pin-point the causes of errors
  • 15. SMT
  • 16. SMT A document is translated according to the probability distribution p(e/f) that a string e in the target language (for example, English) is the translation of a string f in the source language . SMT translations are generated on the basis of statistical models whose parameters are derived from the analysis of text corpora [3]. Statistical MT system is modeled as three separate parts: language model translation model decoder
  • 17. We apply Bayesian approach for this- ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } Language model(LM):assigns a probability to any target string of words {P(e)} an LM probability distribution over strings S that attempts to reflect how frequently a string S occurs as a sentence. Translation model(TM): assigns a probability to any pair of target and source strings {P(f|e)} Decoder: determines translation based on probabilities of LM & TM
  • 18. Translation Models 1.Word Based Model- •the fundamental unit of translation is a word Aligns one word of source language with one word of the target language. Disadvantage-, the number of words in translated sentences are different, because of compound words, morphology and idioms. 2.Phrase-based translation translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases, but typically are not linguistic phrases, but phrasemes found using statistical methods from corpora It has been shown that restricting the phrases to linguistic phrases (syntactically motivated groups of words, see syntactic categories) decreases the quality of translation.  वह बाजार जा रहा है  He is going to the market
  • 19. 3. Syntax-based translation Syntax-based translation is based on the idea of translating syntactic units, rather than single words or strings of words synchronous context-free grammars.-modeling the reordering of clauses that occurs when translating a sentence by correspondences between phrase- structure rules in the source and target languages.
  • 20. 4.Hierarchical phrase-based translation Hierarchical phrase-based translation combines the strengths of phrase-based and syntax-based translation. It uses synchronous context-free grammar rules, but the grammars may be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents
  • 21. What is Moses? It is an open source toolkit Toolkit for (SMT)Statistical Machine Translation Moses is under LGPL license Moses distribution uses external open source tools  word alignment: giza++, mgiza, BerkeleyAligner, FastAlign  language model: srilm, irstlm, randlm, kenlm  scoring: bleu, ter, meteor GIZA++ It is used for making word-alignments This toolkit is an implementation of the original IBM Models that started machine translation research.  SRILM- It is used for language modeling
  • 22. Other Open Source MT Systems • Joshua — Johns Hopkins University http://joshua.sourceforge.net/ • CDec — University of Maryland http://cdec-decoder.org/ • Jane — RWTH Aachen http://www.hltpr.rwth-aachen.de/jane/ • Phrasal — Stanford University http://nlp.stanford.edu/phrasal/ • Very similar technology – Joshua and Phrasal implemented in Java, others in C++ – Joshua supports only tree-based models – Phrasal supports only phrase-based models • Open sourcing tools increasing trend in NLP research
  • 23. HISTORY OF MOSES • 2005 Hieu Hoang (then student of Philipp Koehn) starts Moses as successor to Pharoah • 2006 Moses is the subject of the JHU workshop, first check-in to public repository • 2006 Start of Euromatrix, EU project which helps fund Moses development • 2007 First machine translation marathon held in Edinburgh • 2009 Moses receives support from EuromatrixPlus, also EU-funded • 2010 Moses now supports hierarchical and syntax-based models, using chart decoding • 2011 Moses moves from sourceforge to github, after over 4000 sourceforge check-ins • 2012 EU-funded MosesCore launched to support continued development of Moses
  • 24. Moses Translation Process It involves Segmenting the source sentence into source phrases Translating each source phrase into a target phrase & optionally reordering the target phrases into a target sentence.  वह बाजार जा रहा है  He is going to the market • Foreign input is segmented in phrases • Each phrase is translated into English • Phrases are reordered
  • 25. WHAT DOES MOSES DO? Hindi Eng Parallel Corpus Moses Training SMT Model Moses.ini Target lang(mono. hi) Lang. Model Decoder Source Sentence Target Sentence
  • 26. WORKFLOW FOR BUILDING A PHRASE BASED SMT SYSTEM • Corpus preparation: Train, Tune and Test split • Pre-processing: Normalization, tokenization, etc. • Training: Learn Phrase tables from Training set • Tuning: Learn weights of discriminative model on • Tuning set • Testing: Decode Test set using tuned data • Post-processing: regenerating case, re-ranking • Evaluation: Automated Metrics or human evaluation
  • 27. Components of Moses Training pipeline-It consist of collection of tools (mainly written in perl, with some in C++) which take the raw data (parallel and monolingual) and turn it into a machine translation model Decoder-single C++ application which, given a trained machine translation model and a source sentence, will translate the source sentence into the target language. a variety of contributed tools and utilities like GIZA++ and SRILM
  • 28. Training in Moses 1. Prepare data Training data has to be provided sentence aligned in two files, one for the foreign sentences, one for the English sentences The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is converted into a numberized format. The vocabulary files contain words, integer word identifiers and word count information 2. Run GIZA++ We need it as a initial step to establish word alignments.
  • 29. 3. Align words To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. 4. Get lexical translation table  estimate a maximum likelihood lexical translation table. 5. Extract phrases -all phrases are dumped into one big file 6. Score phrases -estimate the phrase translation probability (ejf) 7. Build lexicalized reordering model 8. Build generation models-The generation model is build from the target side of the parallel corpus. 9. Create Configuration File-As a final step, a configuration file for the decoder is generated with all the correct paths for the generated model and a number of default parameter settings
  • 30. The Decoder The job of the Moses decoder is to find the highest scoring sentence in the target language (according to the translation model) corresponding to a given source sentence. The decoder is written in a modular fashion and allows the user to vary the decoding process in various ways, such as: Input: This can be a plain sentence, or annotated xml-like elements or complex structure like a lattice or confusion network Translation model: This can use phrase-phrase rules, or hierarchical (perhaps syntactic) rules. Decoding algorithm: Decoding is a huge search problem, generally too big for exact search, and Moses implements several different strategies for this search, such as stack-based, cube-pruning, chart parsing etc. Language model: Moses supports several different language model toolkits (SRILM, KenLM, IRSTLM, RandLM) each of which has there own strengths and weaknesses, and adding a new LM toolkit is straightforward. Translation Model:Uses phrase or hierarchical based models.It can be supplemented with features to add extra information to the translation process.
  • 31. Decoder Language Models works with the following language models: SRI language model- SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation It has been under development in the SRI Speech Technology and Research Laboratory since 1995. IRST language model- The IRST Language Modeling Toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM toolkit handles LM formats which permit to reduce both storage and decoding memory requirements, and to save time in
  • 32. RandLM build the largest LMs possible (for example, a 5-gram trained on one hundred billion words ). It represents LMs using a randomized data structure This can result in LMs that are ten times smaller than those created using the SRILM (and also smaller than IRSTLM),  but at the cost of making decoding about four times slower. It is multithreaded . KenLM is a language model that is simultaneously fast and low memory. The probabilities returned are the same as SRI, up to floating point rounding.  It is maintained by Ken Heafield, who provides additional information on his website. KenLM is distributed with Moses and compiled by default. KenLM is fully thread-safe for use with multi-threaded Moses. KenLM is included by default in moses
  • 33. Contributed Tools Moses Server- provides an xml-rpc interface to the decoder Web translation- set of scripts to translate webpage Analysis tools- scripts to enable and analyze the visualization of Moses output
  • 34. Moses Platform Primary development platform for Moses is Linux. & recommended platform is Linux since it is easier to get support for it. However it works on other platforms also.
  • 35. Moses Releases Moses 1.0 (28th Jan 2013) Moses 0.91 (12th Oct 2012)
  • 36. Work Done Aim-To collect parallel corpus of English/Hindi Step 1. Data collection-Data is in pdf form 1. Bilingual data 2. English text available in pdf can easily transfer from one format to another format. .Step 2. Problem - Hindi text cannot be directly copied because of different font’s style Solution We will change the format of data. We are converting pdf file in jpeg form But sometimes when we check the translation of some words are incorrect in translation .Then we collect errors, noted own and try to correct them. Therefore we perform Optical character recognition. No of pdfs files given-9 No of pdf on which OCR is performed-6 Total phrases updated-200 No of files uploaded-4 Duplicacy-5-10 phrases OCR of P. Chidambaram bhasan.
  • 37. MOSES INSTALLATION 1. Download Moses- Moses is downloaded from github. 2. Basic SetUp- To compile Moses, you need the following installed on your machine: g++,Boost 3.After installing these, we need to download and install moses. git clone https://github.com/moses-smt/mosesdecoder.git cd mosesdecoder/ ./bjam -j4
  • 38. MANUALLY INSTALLING BOOST • wget http://downloads.sourceforge.net/project/boost/boost/1.55.0/boost_1_55_0.tar.gz • tar zxvf boost_1_55_0.tar.gz • cd boost_1_55_0/ • ./bootstrap.sh –prefix=/home/angla/SMT/BOOST_HOME • ./b2 -j4 --prefix=$PWD --libdir=$PWD/lib64 --layout=system link=static install || echo FAILURE • Once boost is installed, you can then compile Moses. However, you must tell Moses where boost is with the --with-boost flag. • ./bjam --with-boost=/home/angla/SMT/boost_1_55_0 -j4
  • 39. OTHER SOFTWARE TO INSTALL • Word Alignment- we have installed MGIZA because it is multi-threaded and give general good result. • Language Model Creation- • Moses includes the KenLM language model creation program. • We have used IRSTLM Language model. • Other software's- su yum install [package name] • Packages: • git • subversion • automake • libtool • gcc-c++ • zlib-devel • python-devel • bzip2-devel
  • 40. INSTALLING IRSTLM • IRSTLM is a language modelling toolkit from FBK • install IRSTLM- • tar zxvf irstlm-5.80.03.tgz • cd irstlm-5.80.03 • sh regenerate-makefiles.sh • ./configure --prefix=/home/angla/SMT/IRSTLM_HOME • make install
  • 41. INSTALLING MGIZA • Download mgiza from https://github.com/moses-smt/mgiza • Build mgiza- • cd mgiza/mgizapp • cmake. • make • make install Note-we need to build boost first. Compiling Moses- ./bjam –with-irstlm=/home/angla/SMT/IRSTLM_HOME / –with-boost=/home/angla/SMT/boost_1_55_0
  • 42. CORPUS PREPARATION • To train a translation system we need parallel data (text translated in Hindi and English) which is aligned at the sentence level. • To prepare the data for training the translation system, we have to perform the following steps: • tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation. • truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity. • cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.
  • 43. TOKENIZATION • ~/SMT/mosesdecoderv211/scripts/tokenizer/tokenizer.perl -l en < ~/SMT/corpus/training/train.en > ~/SMT/corpus/train.tok.en • ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/ SMT/corpus/training/train.hi > ~/ SMT/corpus/training/train.tok.hi TRUECASING • ~/SMT/mosesdecoderv211/scripts/recaser/train-truecaser.perl --model ~/corpus/truecase-model.en --corpus <~/SMT/corpus/train.tok.en>~/SMT/corpus/train.true.en • ~/SMT/mosesdecoderv211/scripts/recaser/train-truecaser.perl --model ~/corpus/truecase-model.fr --corpus <~/corpus/train.tok.hi>~/SMT/corpus/train.true.hi Cleaning • ~/SMT/mosesdecoder/scripts/training/clean-corpus-n.perl ~/corpus/train.true en hi ~/SMT/corpus/train.clean 1 80
  • 44. LANGUAGE MODEL TRAINING • The language model (LM) is used to ensure fluent output, so it is built with the target language (i.e English in this case) • The language model should be trained on a corpus that is suitable to the domain. If the translation model is trained on a parallel corpus, then the language model should be trained on the output side of that corpus.We do 3-gram language model, removing singletons, smoothing with improved Kneser-Ney, and adding sentence boundary symbols • mkdir ~/SMT/lm • cd ~/lm • ~/SMT/IRSTLM_HOME/bin/add-start-end.sh < ~/SMT/corpus/train.true.en > train.sb.en • export IRSTLM=$HOME/angla/SMT/irstlm;
  • 45. ~/irstlm/bin/compile-lm --text=yes train.lm.en.gz train.arpa.en Binary Language Models This format can be properly managed through the compile-lm command in order to produce a compiled version or a standard ARPA version of the LM. Building Huge Language Models LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. We use the script build-lm.sh ~/irstlm/bin/build-lm.sh -I train.sb.e n -t ./tmp -p -s improved-kneser-ney -o train.lm.en The script builds a 3-gram LM (option -n) from the specified input command (-i), by splitting the training procedure into 10 steps (-k). The LM will be saved in the output (-o) file train.irstlm.gz with an intermediate ARPA format.
  • 46. TRAINING THE TRANSLATION SYSTEM • To do this, we run word-alignment (using GIZA++), • phrase extraction • and scoring, • create lexicalised reordering tables • and create your Moses configuration file, all with a single command • mkdir ~/working • cd ~/working • nohup nice ~/mosesdecoder/scripts/training/train-model.perl -root-dir train - /SMT/corpus ~/corpus/train.clean -f hi -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:~/SMT/lm/train.blm.en:8 -mgiza -external-bin-dir ~/mosesdecoder/tools >& training.out & • Once it's finished there should be a moses.ini file in the directory ~/working/train/model.
  • 47. STEP 1 • Create a parallel corpus: one sentence per line format Step 2 • Run plain2snt.out located within the GIZA++ package •./plain2snt.out hindi english • Files created by plain2snt • train-en.vcb • train-hi.vcb • train-en-train-hi.snt
  • 48. FILES CREATED BY plain2snt • english.vcb consists of: • each word from the english corpus • corresponding frequency count for each word • an unique id for each word • french.vcb • each word from the french corpus • corresponding frequency count for each word • an unique id for each word • frenchenglish.snt consists of: • each sentence from the parallel english and french corpi translated into the unique number for each word
  • 49. CREATE MKCLS FILES NEEDED FOR GIZA++: Run _mkcls which is not located within the GIZA++ package •mkcls –pengish –Venglish.vcb.classes •mkcls –pfrench –Vhindi.vcb.classes Files created by _mkcls • english.vcb.classes • english.vcb.classes.cats • hindi.vcb.classes • hindi.vcb.classes.cats • .vcb.classes files contains: • an alphabetical list of all words (including punctuation) • each words corresponding frequency count • .vcb.classes.cats files contains • a list of frequencies • a set of words for that corresponding frequency
  • 50. TUNING • During decoding, Moses scores translation hypotheses using a linear model. • In the traditional approach, the features of the model are the probabilities from the language models, phrase/rule tables, and reordering models, plus word, phrase and rule counts. Recent versions of Moses support the augmentation of these core features with sparse features, which may be much more numerous. • Tuning refers to the process of finding the optimal weights for this linear model, where optimal weights are those which maximise translation performance on a small set of parallel sentences (the tuning set). • Translation performance is usually measured with Bleu, but the tuning algorithms all support (at least in principle) the use of other performance measures like Minimum error rate training (MERT). • This line-search based method is probably still the most widely used tuning algorithm, and the default option in Moses.
  • 51. TUNING COMMANDS • Tuning requires a small amount of parallel data, separate from the training data. • cd ~/working • nohup nice ~/mosesdecoder/scripts/training/mert-moses.pl • ~/corpus/news-test2008.true.fr ~/corpus/news-test2008.true.en • ~/mosesdecoder/bin/moses train/model/moses.ini --mertdir ~/mosesdecoder/bin/ • &> mert.out & • The end result of tuning is an ini file with trained weights, which should be in ~/working/mert- work/moses.ini
  • 52. TESTING • ~/mosesdecoder/bin/moses -f ~/working/mert-work/moses.ini • In order to make it start quickly, we can binarise the phrase- table and lexicalised reordering models. To do this, create a suitable directory and binarise the models as follows: • mkdir ~/working/binarised-model • cd ~/working • ~/mosesdecoder/bin/processPhraseTableMin -in train/model/phrase-table.gz -nscores 4 -out binarised-model/phrase-table In order to bring processPhraseTableMin we need to compile moses with cmph.
  • 53. COMPILING MOSES WITH CMPH • Download cmph from http://sourceforge.net/projects/cmph/ • Install cmph by- • cd cmph 2.0 • ./configure to configure package for system. • make check • Make install • Compile with boost- • Cd mosesdecoder.v211 • ./bjam –with-cmph=~/SMT/cmph2.0
  • 54. BINARISING REORDERING TABLE • ~/mosesdecoder/bin/processLexicalTableMin • -in train/model/reordering-table.wbe-msd-bidirectional-fe.gz • -out binarised-model/reordering-table • ERROR!!!! Not in correct format !!!
  • 55. PROBLEMS FACED DURING INSTALLATION • ./bjam requires boost 104400 but we have 104100. • Reason- boost not installed to specific folder. • Solution-install boost
  • 56. Linux commands tar command- It deal with tape drives backup. The tar command used to rip a collection of files and directories into high tar -cvf tecmint-14-09-12.tar /home/tecmint/ Let’s discuss the each option we have used in the above command for crea c – Creates a new .tar archive file. v – Verbosely show the .tar file progress. f – File name type of the archive file.
  • 57. 2. Command: ls The command “ls” stands for (List Directory Contents), List the contents of the folder, be it file or folder, from which it runs. 3.uname The “uname” command stands for (Unix Name), print detailed information about the machine name, Operating System and Kernel. 4.Command: history- Shows history of commands executed in kernel.
  • 58. 5.apt-get- perform installation of new software packages, removing existing software packages, upgrading of existing software packages and even used to upgrading the entire operating system apt-cache pkgnames 6.grep command examples Search for a given string in a file (case in-sensitive search). $ grep -i "the" demo_file 7.vim command Go to the 143rd line of file $ vim +143 filename.txt 8.cd command Use “cd -” to toggle between the last two directories
  • 59. 9. free command examples This command is used to display the free, used, swap memory available in the system. 10.cat command examples You can view multiple files at the same time. Following example prints the content of file1 followed by file2 to stdout. $ cat file1 file2 11.chmod command chmod command is used to change the permissions for a file or directory. Give full access to user and group (i.e read, write and execute ) on a specific file. $ chmod ug+rwx file.txt
  • 60. Importance of Moses Moses is an installable software unlike other online-only translation systems Online systems cannot be trained on our own data There is also a problem with privacy, if you have to translate sensitive info.
  • 61. Conclusion Moses is an open source toolkit, so that the users can modify and customize the toolkit based on their needs and requirements.
  • 62. Reference Mt paper of Harold somer www.statmt.org/moses Moses: Open Source Toolkit for Statistical Machine Translation by Philipp Koehn Statistical Machine Translation by Philipp Kohen  AnglaHindi: An English to Hindi Machine-Aided Translation Systemby R.M.K. Sinha

Hinweis der Redaktion

  1. 1
  2. 2
  3. 3
  4. 6
  5. 9
  6. 10
  7. 15
  8. 29