SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Experiment With Different Models Of
Statistical Machine Translation
Submitted by-
Khyati gupta(14483)
Rakhi Sharma(14514)
Project Presentation
ON
Contents
 Problem Statement
 Objective
 About the project
 Flow chart
 Work done
 Conclusion
 Future work
 Reference
Problem Statement
• Machine Translation is quite popular in research field since 1990’s.
• But little work has been done in Indian Languages as the current state-of-the-
art is quite bleak due to sparse data resources.
• The success of an SMT is dependent on the availability of a large parallel
corpus.
• Such a data is necessary to reliably estimate translation probabilities.
• We have worked on Hindi to English Translation.
Objective
The objectives of our thesis is-
• Work on Different models of Statistical Machine Translation..
• Report the result obtained
• The SMT models studied are-
SMT
TREE
HIERARCHICAL SYNTAX
STRING
PHRASE
Introduction
What is Translation
Process of converting text from one language to another, so that the
original message is retained in target language.
Source Language = language whose text is to be translated.
Target Language = language in which the text is translated.
What is machine translation?
Machine translation is automated translation or “translation carried out by
a computer.” It is a process, sometimes referred to as Natural Language
Processing which uses a bilingual data set and other language assets to
build language and phrase models used to translate text from source
language to another language.
About the Project
• Study the basics of SMT
• Installation of Moses, IRSTLM and MGIZA.
• Study various models of SMT like phrase, syntax, hierarchical
model
• Creation of parallel Corpus
• Experiment translation from Hindi to English using different
models of SMT.
• Conversion of Parser’s output into Moses format .
• Find out result on the basis of Score obtained .
• Evaluate the best model of SMT for a given corpus.
Flowchart of SMT
Bayesian Approach
• We apply Bayesian approach for this-
• Language model(LM):assigns a probability to any target string
of words {P(e)}
• an LM probability distribution over strings S that attempts to
reflect how frequently a string S occurs as a sentence.
• Translation model(TM): assigns a probability to any pair of
target and source strings {P(f|e)}
• Decoder: determines translation based on probabilities of LM &
TM
argmaxe p(e|f) = argmaxep(f|e) p(e)
Language Model
• A simple model of language Computes a probability of the sentence.
• Goal of the Language Model: Detect good English.
• SMT uses n-gram approach to computing probability of LM.
• A sentence is composed of product of conditional probability of component
words.
• Probability of a word is calculated by that word given the preceding words.
calculate
• Likelihood of sentence P(S) =P(W1)*P(W2)*….. *P(N)
= P(w1) × P(w2|w1) × … × P(wn|wn-1)
• Example illustrating bigram model-
P(the barking dog) = P(the|<start>)P(barking|the)P(dog|barking)
Translation Models
P(s|e) is called Translation model. It is used to give better scores to accurate
and complete .It is trained on bilingual Hindi-English parallel data.
Approaches for translation models are-
1. Phrase-based translation
• The sequences of words are called blocks or phrases, but typically are not linguistic
phrases, but phrasemes found using statistical methods from corpora
2 Hierarchical phrase-based translation
• . Hierarchical phrase-based translation combines the strengths of phrase-based and
syntax-based translation.
• It uses synchronous context-free grammar rules, but the grammars may be
constructed by an extension of methods for phrase-based translation without
reference to linguistically motivated syntactic constituents
3. Syntax based Model
• Syntax model works on syntactic categories of word and uses CFG grammar.
Decoding
• The task of decoding in machine translation is to find the best
scoring translation according to these formulae.
• Given a Hindi sentence f, it finds the English yield of the single
best derivation that has Hindi yield f:
• Phrase based model uses beam search algorithm.
• Tree based models use chart decoding.
System Overview
Work Done
Data Pre-Processing Flowchart
Bilingual Text Aligner
Optical character recognition
Convert pdf into jpeg
Sources(pdf)
Data Conversion
pdf
Convert to jpeg jpeg
OCR
(using Indisenz )
Bilingual Text Alignment
(using Microsoft Aligner)
Corpus Preparation
To prepare the data for training the translation system, we have to
perform the following steps:
• Tokenisation: This means that spaces have to be inserted between
(e.g.) words and punctuation.
• Truecasing: The initial words in each sentence are converted to
their most probable casing. This helps reduce data sparsity.
• Cleaning: Long sentences and empty sentences are removed as
they can cause problems with the training pipeline, and obviously
misaligned sentences are removed.
Training in Moses
1. Prepare data
• Training data has to be provided sentence aligned in two files, one
for the foreign sentences, one for the English sentences
• The parallel corpus has to be converted into a format that is suitable
to the GIZA++ toolkit.
• Two vocabulary files are generated and the parallel corpus is
converted into a numberized format.
• The vocabulary files contain words, integer word identifiers and
word count information.
2. Run GIZA++
• GIZA++ is a freely available implementation of the IBM models.
We need it as a initial step to establish word alignments.
मेरे दोस्त के लिए पान दो
GIVE
A
BETTLE
FOR
MY
FRIEND
3. Align words
• To establish word alignments based on the two GIZA++ alignments, a
number of heuristics may be applied.
4. Get lexical translation table
Estimate a maximum likelihood lexical translation table.
We estimate the w(e|f) as well as the inversew(f|e) word translation table.
5. Extract phrases -all phrases are dumped into one big file
6. Score phrases -estimate the phrase translation probability (ejf)
जहानाबाद *दरभंगा ||| darbhanga* navada* ||| 1 1 1 1 ||| 0-0 1-1 ||| 1 1
7. Build lexicalized reordering model
Moses use lexicalized reordering models for reordering.
8. Build generation models-
The generation model is build from the target side of the parallel corpus.
9. Create Configuration File-
As a final step, a configuration file for the decoder is generated with all
the correct paths for the generated model and a number of default
parameter settings
Tuning
• Once training is over, the parameters of the log-linear model have
to be tuned to avoid over fitting on training data produce the most
desirable translation on any test set. This process is called tuning.
The basic assumption behind tuning is that the model must be
tuned according to the evaluation techniques.
• That’s why tuning technique is known as Maximum Error rate
training.
Working of Models performed
1.Working of Phrase based Model
•The Hindi sentence is first broken down into phrases based on statistics
drawn from parallel corpora.
•Then these Hindi phrases are translated into English phrases.
•Translated English phrases are reordered.
2.Working of Hierarchical Model
• ALL the phases performed by Moses in hierarchal model are same as
phrase passed model but the rule extraction of hierarchal model is differ
from phrase based SMT.
It include -
 Data Preparation
• Tokenization
• True casing
• Cleaning
 Training
• word alignment
• rule extraction
• Glue rule
• Extract phrase with phrase extraction table
• Reordering Model
• Language Modelling
 Decoding
 Tuning
 Blue Score
Advantage of Hierarchical Model
• Hierarchical MT replace redundant rule used in phrase based MT
into single rule.
• It also overcome the problem of other model it does not require
annotated corpora at all or automatically generate it.
• We are working on Hindi to English translation
English already have annotated data and Hindi will be
automatically annotated by hierarchical model .
• The grammar used correction in known as synchronous context free
grammar.
Synchronous Context Free Grammar
• SCFG is a kind of context free grammar that generates pair of
strings.
• Example:- S -> (I, में )
• This rule translates ’I’ in English to में in Hindi.
• This rule consists of terminals only but rules may consist of
terminals and non-terminals as described below.
• VP ->(V1 NP2, NP2 V1 )
Rule Extraction with SCFG
• Hierarchical model not only reduces the size of a grammar.
 It also uses the same rules for parsing as well as translation.
Steps performed in rule extraction
• In hierarchical Model intervening words can be separated. these
are replace by non-terminal X.
• Synchronization is required between sub-phrases This model does
not require parser at the Hindi side because all phrase are labelled
as X.
This allow us to build useful translation rule such as
X- ( X1 kA X2 , X2 of X1 )
• Some examples
• भारत का प्रधान मंत्री- ->
Prime Minister of India
• जापान का प्रधान मंत्री- ->
Prime Minister of Japan
• चीन का वित्त मंत्री- -> Finance
Minister of China
• भारत का राष्ट्रीय पक्षी-> National
bird of India
• Phrase based model memorises
all these phrases, but essentially
all phrases have the same
structure i.e.
• where X1 is prime minister or
“प्रधान मंत्री” X2 is India or
“भारत”
GLUE RULE
• Glue rules facilitate the concatenation of two trees originating from the
same Nonterminal. Here are the two glue rules.
• S-S1 X2, S1 X2
• S- X1, X1
• These two rules in conjunction can be used to concatenate discontigous
phrases. So, input to the system is a sentence in hindi and a set of SCFG
rules extracted from training set..
• To avoid ruleset of unmanageable size and reduce decoding
complexity, we typically set limits on possible rule
• At most 2 non-terminal symbol
• At least one but at most 5 words/language
• Span at most 15 words
3.Working of Syntax Model
• Earlier models did not include any linguistic information on trained data
which produced grammatically incoherent output.
• The persistence of reordering problem in translated text led to development
of syntax based model. In this model Moses is trained on syntactic phrases
on Target side.
• Syntactic information includes root word, word class, POS category. We
have syntactic parsing on English language in our work.
ADVANTAGES
• Since Hindi is syntactically divergent language, this model overcomes the
reordering problem faced in phrase based and hierarchical based model.
• Syntax based MT performs well in case of structural divergent language.
Hindi observes SOV structure while English observes SVO structure.
• This model improves the resultant sentence grammatically.
MODEL
VB
PRP VB1 VB2
He adores VB TO
Listening TO
To MN
Music
VB
PRP VB2 VB1
He TO VB adores
TO MN Listening
to music
REORDERING
Cont.
…..
VB
PRP VB2 VB1
He TO VB करत adores ह
TO MN Listening क
to music
Insertion
VB
PRP VB2 VB1
िह TO VB करत यार ह
TO MN नन क
क ंगीत
Translation
िह
ंगीत
नन क
यार
करत ह
Working
• The string-to-tree model accepts a Hindi string as input and seeks across multiple
parsed English trees and finds the highest scoring tree.
• Input is a string- व्यक्तिगि जीवन
• Translation Rules-
• [SYM][X] personal [NN][X] [FRAG] ||| [SYM][X] व्यक्ततगत [NN][X] [X] |||
0.0326378 0.6 0.0652757 1 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||
• [SYM][X] personal life [FRAG] ||| [SYM][X] व्यक्ततगत जीिन [X] ||| 0.0326378
0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||
• [SYM][X] personal life [TOP] ||| [SYM][X] व्यक्ततगत जीिन [X] ||| 0.0326378
0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||
• Decoding by Translation Rules-
• [0..3]: [3..3]=</s> [0..2]=S : S ->S -> S </s> :0-0 : c=0 core=(0,-1,1,0,0,0,0,0,0)
0core=(0,-4,6,-11.5445,-5.99562,-7.46699,-1.60944,1.99979,-16.0431)
• [0..1]: [1..1]= X [0..0]=S : S ->S -> S X :0-0 1-1 : c=0 core=(0,-
0,1,0,0,0,0,0.999896,0) 0core=(0,-2,3,-3.35156,-0.916291,-2.43527,0,0.999896,-
7.74303)
• [0..0]: [0..0]=<s> : S ->S -> <s> :: c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,-
1,1,0,0,0,0,0,0)
• [1..1]: [1..1]=personel : X ->X -> व्यक्ततगत :: c=0 core=(0,-1,1,-3.35156,-
0.916291,-2.43527,0,0,0) 0core=(0,-1,1,-3.35156,-0.916291,-2.43527,0,0,-
9.44562)
• The target tree it produces is
• Output is a string- personal life
(TOP <s> (S (NP personal) (NP (NN life)))) </s>)
4.Working of Hybrid Translation
• The main disadvantage in Statistical Machine Translation (SMT) is
that it only translates phrases which were seen during training.
• Unseen phrases such as named entities are not translated .
• This leads to low bleu score .We can improve bleu score by
translating named entities from external source.
Working
Preprocessing
Translation by
Moses Decoder
Postprocessing
आपको नए <n translation=monastery >आश्रम</n> के ननममाण के लिए ककिने धन की आवश्यकिम है
Preprocessing of Data-
Moses accept data in following format for hybrid translation-
Translation by Moses Decoder-
We translate normally using Moses decoder which is trained on our data. The translation using Moses decoder is-
How much money you need for the construction of the new आश्रम??
Here word आश्रम is left untranslated.
Post processing-
The untranslated word can be translated by referring the xml tags. The output obtained is-
How much money youo need fr the construction of the new monastery?
Result of Hybrid Translation
• Exclusive Only the XML-specified translation is used for the
input phrase. Any phrases from the phrase table that overlap
with that span are ignored.
• Inclusive The XML-specified translation competes with all
the phrase table choices for that span.
• Ignore The XML-specified translation is ignored completely.
Xml-exclusive: 7.21
Xml-inclusive 7.36
Xml-ignore 6.18
Syntax Model Parsing Extended
BERKELEY PARSER
We have used Berkeley parser for parsing
English language in our project. Since we
had parser for English language so we
trained our system on string-to-tree and
tree-to-string.
Input -Economic Services
ENJU PARSER
With a wide-coverage probabilistic HPSG
grammar and an efficient parsing
algorithm, this parser can effectively
analyze syntactic/semantic structures of
English sentences and provide a user with
phrase structures and predicate-argument
structures.
Motivation
• Moses accepts data for training syntax model in XML format.
• <tree label="NP"> <tree label="DET"> the </tree> <tree label="NN"> cat </tree>
</tree>
• There are a number of parsers available for parsing. Each parser has its own
idiosyncratic input and output format. Hence, we need to process the output of these
parser in the format compatible with Moses for syntax model. There are 3 wrapper
scripts available in Moses decoder /scripts/training/wrapper for converting the parser
output into Moses format. These are-
• Parse-en-collins.perl – This script is used with Collins parser available from MIT.
• Parse-de-bitpar.perl – This script is used with Bitapar parser available from
University of Munich.
• Parse-de-berkeley- This script is used with Berkeley parser available from UC
Berkeley.
• We used Enju parser for our experiment we were motivated to write a wrapper
script for this purpose.
• Hence we wrote a wrapper script to convert Enju parser output to Moses format
compatible for syntax trees.
Format Conversion-
 We designed a program to
convert XML output of Enju
parser to Moses compatible
XML format.
But Enju and Penn Tree Bank
have different syntactic
categories.
 Because the output of Enju is
based on HPSG and it is
different from the annotation
policy of PTB, tree structures
and/or syntactic categories are
often different from those given
by the PTB-style annotation.
However, these mappings
provide a clear image of what
Enju expresses. So we mapped
Enju categories to PTB style for
our experiment.
Steps-
1. For every <sentence> tag , form a output string by adding <tree label =”TOP”>
2. For every <cons> tag
i. Retrieve its CAT value ($CAT_VALUE).
ii. Retrieve its XCAT value ($XCAT_VALUE).
iii. If the XCAT value of the CONS element is non-empty:
iv. Find the corresponding POS tag by comparing it with the mapping table.
v. Add new tree tag to the given output string by adding <tree label=”CONS_POS”>
where CONS_POS is the POS category derived from mapping table.
3. For every <tok> tag
i. Retrieve its POS value ($POS_VALUE).
ii. Add new tree tag to the given output string by adding <tree label=”POS”> where POS
is the POS category derived from POS attribute from tok tag.
4. For every closing </sentence> tag, add new closing </tree> tag.
5. For every closing </cons> tag, add new closing </tree> tag.
6. For every closing </tok> tag, add new closing </tree> tag.
7. All unnecessary attributes are omitted.
Challenges-
• The deep syntactic parser we used was Enju5 (Miyao and Tsujii, 2005),
which is based on HPSG and outputs both (dependency-like) predicate-
argument relations (Miyao, 2007) and phrase structure trees (although
these do not follow the PTB scheme for phrase structure trees) in an XML
format.
• The Berkeley is a phrase structure grammar parser based on PBT
grammar.
• The output of both the parsers differ in tree structure since Enju’s
syntactic representation is richer, but still quite challenging. Enju parser
produces strictly binary trees while Berkeley parser produces binary trees.
Also the tress in the number of levels and structure.
• This made the task of converting Enju output to Moses Format difficult.
Conclusion -
• We trained syntax model on converted Enju output. There was not any
major effect on the bleu score.
Result
INTERFACE - Phrase Based Translation
• Input-य क्षत्र यमना पार कहलात ह िै य नई ददल्ली बहत पलों द्िारा
भली भांतत जड हए ह
• Output-it regions are caled yamuna par and they new delhi these are also
joined by many bridges from
Hierarchical Based
• Input-य क्षत्र यमना पार कहलात ह िै य नई ददल्ली बहत पलों द्िारा भली भांतत जड
हए ह
• Output- so these regions are caled yamuna par and they from new delhi पलों by भली भांतत जड
front are
Syntax based
 Input-य क्षत्र यमना पार कहलात ह िै य नई ददल्ली बहत पलों द्िारा भली भांतत जड
हए ह
 Output-it caled yamuna par regions are and it from new delhi of the world the very
popular from पलों by भली bridges from are
Corpus
Type Source
Gyan nidhi Downloaded from Joshua
Miscellaneous PM speech(July 2015),Budget Data(
2014),Vigyan Prashar magazine
ACL2005 Available by Cdac, Noida
Agriculture www.pib.gov.in Govt of India
Result of Comparing Models of SMT
Agriculture ACL 2005 Gyan Nidhi Misc.
Phrase 3.48 6.18 3.61 3.45
Heirarchical 3.27 13.8 4.3 5.2
Syntax ST 2.93 10.79 3.21 2.9
Syntax TS 1.2 2.3 0.9 1.5
3.48
6.18
3.61 3.453.27
13.8
4.3
5.2
2.93
10.79
3.21 2.9
1.2
2.3
0.9
1.5
0
2
4
6
8
10
12
14
16
ModelsScore
Corpus
Comparison of SMT Models
Phrase Heirarchical Syntax ST Syntax TS
Conclusion
We are developing Hindi to English translation system and comparing
the results obtained by various models. .During the course of this
project, the various models of translation had been evaluated and it is
concluded that “Hierarchical based model” is the best approach to carry
out this task. The result is verified both on the various English and
Hindi sentences corpus. The project concludes with the tasks showing
the excellent and desired result as needed. The project, at the end is
completed and successfully tested.
Future Work
We need to –
• Perform and compare results of factored model on Moses.
• Find and replace OOV words.
• Compare the effect of replacing OOV words on blue score.
• Transliterate unknown words.
• We propose a technique “word to vec” for hybrid translation that
can automate the process of generating dictionaries and phrase
table.
References
• Statistical Phrase-Based Translation by Philipp Koehn, Franz Josef Och,
Daniel Marcu Information Sciences Institute Department of Computer Science
University of Southern California koehn@isi.edu , och@isi.edu , marcu@isi.edu
• A Hierarchical Phrase-Based Model for Statistical Machine Translation by-
David Chiang Institute for Advanced Computer Studies (UMIACS)University of
Maryland, College Park, MD 20742, USA dchiang@umiacs.umd.edu
• Philipp Koehn. 2004b. Statistical significance tests for machine translation
evaluation. In Proceedings of the 2004 Conference on Empirical Methods in
Natural Language Processing (EMNLP),
• Richard Zens and Hermann Ney. 2004. Improvements in phrase-based
statistical machine translation. In Proceedings of HLT-NAACL 2004,
• Hierarchical Phrase-Based Statistical Machine Translation System Mtech.
Project Dissertation by Bibek Behera under the guidance of Prof. Pushpak
Bhattacharyya Department of Computer Science and Engineering Indian
Institute of Technology, Bombay
• Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.and Herbst, E. (2007). Moses: open source toolkit
for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive
Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for
Computational Linguistics.
 Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages
263–270, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Sinha, R. M. K. and Thakur, A. (2005). Machine translation of bi-lingual hindi-english (hinglish) text.
10th Machine Translation summit (MT Summit X), Phuket, Thailand, pages 149–156.Kunal Sachdeva,
Rishabh Srivastava, Sambhav Jain, Dipti Misra Sharma
 Language Technologies Research Center, International Institute of Information Technology, Hyderabad,
Hindi to English Machine Translation: Using Effective Selection in Multi-Model SMT
 Amr Ahmed and Greg Hanneman, Syntax-Based Statistical Machine Translation:A review
 Aswani, N. and Gaizauskas, R. (2005). A hybrid approach to align sentences and words in English–
Hindi parallel corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, pp.
57–64, Ann Arbor, Michigan. Association for Computational Linguistics.
 Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing (2nd edition). Prentice Hall

Weitere ähnliche Inhalte

Was ist angesagt?

Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslationEffectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslationSunayana Gawde
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Naoki Nakatani
 
Maximum-Length Comparison Method Of Automatic Word Segmentation for Myanmar...
Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar...Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar...
Maximum-Length Comparison Method Of Automatic Word Segmentation for Myanmar...Htet Myet Lynn
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Datadanielschulz2005
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...ijitjournal
 

Was ist angesagt? (15)

Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslationEffectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
Effectof morphologicalsegmentation&amp;de segmentationonmachinetranslation
 
Kaggle nlp approaches
Kaggle nlp approachesKaggle nlp approaches
Kaggle nlp approaches
 
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
 
Maximum-Length Comparison Method Of Automatic Word Segmentation for Myanmar...
Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar...Maximum-Length Comparison Method  Of Automatic Word Segmentation  for Myanmar...
Maximum-Length Comparison Method Of Automatic Word Segmentation for Myanmar...
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Canonical Formatted Address Data
Canonical Formatted Address DataCanonical Formatted Address Data
Canonical Formatted Address Data
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Tldr
TldrTldr
Tldr
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...
 

Ähnlich wie project present

System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit IIIManoj Patil
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02riddhi viradiya
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdfAmir Abdalla
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiPadma Metta
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine TranslationRIILP
 
Moses Statistical Machine Translation tool
Moses Statistical Machine Translation toolMoses Statistical Machine Translation tool
Moses Statistical Machine Translation toolyashothara shanmugarajah
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognitionAditya Kumar Khare
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryPriyatham Bollimpalli
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...Hayahide Yamagishi
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Jaya Mathew
 
Joint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for ParaphraseJoint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for ParaphraseMasahiro Kaneko
 

Ähnlich wie project present (20)

Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
System Programming Unit III
System Programming Unit IIISystem Programming Unit III
System Programming Unit III
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine TranslationRoee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
Roee Aharoni - 2017 - Towards String-to-Tree Neural Machine Translation
 
Moses
MosesMoses
Moses
 
Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Machine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to HindiMachine Translation System: Chhattisgarhi to Hindi
Machine Translation System: Chhattisgarhi to Hindi
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation7. Trevor Cohn (usfd) Statistical Machine Translation
7. Trevor Cohn (usfd) Statistical Machine Translation
 
Moses Statistical Machine Translation tool
Moses Statistical Machine Translation toolMoses Statistical Machine Translation tool
Moses Statistical Machine Translation tool
 
Eskm20140903
Eskm20140903Eskm20140903
Eskm20140903
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognition
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...Breaking the language barrier: how do we quickly add multilanguage support in...
Breaking the language barrier: how do we quickly add multilanguage support in...
 
Joint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for ParaphraseJoint Copying and Restricted Generation for Paraphrase
Joint Copying and Restricted Generation for Paraphrase
 

project present

  • 1. Experiment With Different Models Of Statistical Machine Translation Submitted by- Khyati gupta(14483) Rakhi Sharma(14514) Project Presentation ON
  • 2. Contents  Problem Statement  Objective  About the project  Flow chart  Work done  Conclusion  Future work  Reference
  • 3. Problem Statement • Machine Translation is quite popular in research field since 1990’s. • But little work has been done in Indian Languages as the current state-of-the- art is quite bleak due to sparse data resources. • The success of an SMT is dependent on the availability of a large parallel corpus. • Such a data is necessary to reliably estimate translation probabilities. • We have worked on Hindi to English Translation.
  • 4. Objective The objectives of our thesis is- • Work on Different models of Statistical Machine Translation.. • Report the result obtained • The SMT models studied are- SMT TREE HIERARCHICAL SYNTAX STRING PHRASE
  • 5. Introduction What is Translation Process of converting text from one language to another, so that the original message is retained in target language. Source Language = language whose text is to be translated. Target Language = language in which the text is translated. What is machine translation? Machine translation is automated translation or “translation carried out by a computer.” It is a process, sometimes referred to as Natural Language Processing which uses a bilingual data set and other language assets to build language and phrase models used to translate text from source language to another language.
  • 6. About the Project • Study the basics of SMT • Installation of Moses, IRSTLM and MGIZA. • Study various models of SMT like phrase, syntax, hierarchical model • Creation of parallel Corpus • Experiment translation from Hindi to English using different models of SMT. • Conversion of Parser’s output into Moses format . • Find out result on the basis of Score obtained . • Evaluate the best model of SMT for a given corpus.
  • 8. Bayesian Approach • We apply Bayesian approach for this- • Language model(LM):assigns a probability to any target string of words {P(e)} • an LM probability distribution over strings S that attempts to reflect how frequently a string S occurs as a sentence. • Translation model(TM): assigns a probability to any pair of target and source strings {P(f|e)} • Decoder: determines translation based on probabilities of LM & TM argmaxe p(e|f) = argmaxep(f|e) p(e)
  • 9. Language Model • A simple model of language Computes a probability of the sentence. • Goal of the Language Model: Detect good English. • SMT uses n-gram approach to computing probability of LM. • A sentence is composed of product of conditional probability of component words. • Probability of a word is calculated by that word given the preceding words. calculate • Likelihood of sentence P(S) =P(W1)*P(W2)*….. *P(N) = P(w1) × P(w2|w1) × … × P(wn|wn-1) • Example illustrating bigram model- P(the barking dog) = P(the|<start>)P(barking|the)P(dog|barking)
  • 10. Translation Models P(s|e) is called Translation model. It is used to give better scores to accurate and complete .It is trained on bilingual Hindi-English parallel data. Approaches for translation models are- 1. Phrase-based translation • The sequences of words are called blocks or phrases, but typically are not linguistic phrases, but phrasemes found using statistical methods from corpora 2 Hierarchical phrase-based translation • . Hierarchical phrase-based translation combines the strengths of phrase-based and syntax-based translation. • It uses synchronous context-free grammar rules, but the grammars may be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents 3. Syntax based Model • Syntax model works on syntactic categories of word and uses CFG grammar.
  • 11. Decoding • The task of decoding in machine translation is to find the best scoring translation according to these formulae. • Given a Hindi sentence f, it finds the English yield of the single best derivation that has Hindi yield f: • Phrase based model uses beam search algorithm. • Tree based models use chart decoding.
  • 14. Data Pre-Processing Flowchart Bilingual Text Aligner Optical character recognition Convert pdf into jpeg Sources(pdf)
  • 17. Bilingual Text Alignment (using Microsoft Aligner)
  • 18. Corpus Preparation To prepare the data for training the translation system, we have to perform the following steps: • Tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation. • Truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity. • Cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously misaligned sentences are removed.
  • 19. Training in Moses 1. Prepare data • Training data has to be provided sentence aligned in two files, one for the foreign sentences, one for the English sentences • The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit. • Two vocabulary files are generated and the parallel corpus is converted into a numberized format. • The vocabulary files contain words, integer word identifiers and word count information. 2. Run GIZA++ • GIZA++ is a freely available implementation of the IBM models. We need it as a initial step to establish word alignments.
  • 20. मेरे दोस्त के लिए पान दो GIVE A BETTLE FOR MY FRIEND 3. Align words • To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. 4. Get lexical translation table Estimate a maximum likelihood lexical translation table. We estimate the w(e|f) as well as the inversew(f|e) word translation table.
  • 21. 5. Extract phrases -all phrases are dumped into one big file
  • 22. 6. Score phrases -estimate the phrase translation probability (ejf) जहानाबाद *दरभंगा ||| darbhanga* navada* ||| 1 1 1 1 ||| 0-0 1-1 ||| 1 1 7. Build lexicalized reordering model Moses use lexicalized reordering models for reordering. 8. Build generation models- The generation model is build from the target side of the parallel corpus. 9. Create Configuration File- As a final step, a configuration file for the decoder is generated with all the correct paths for the generated model and a number of default parameter settings
  • 23. Tuning • Once training is over, the parameters of the log-linear model have to be tuned to avoid over fitting on training data produce the most desirable translation on any test set. This process is called tuning. The basic assumption behind tuning is that the model must be tuned according to the evaluation techniques. • That’s why tuning technique is known as Maximum Error rate training.
  • 24. Working of Models performed
  • 25. 1.Working of Phrase based Model •The Hindi sentence is first broken down into phrases based on statistics drawn from parallel corpora. •Then these Hindi phrases are translated into English phrases. •Translated English phrases are reordered.
  • 26. 2.Working of Hierarchical Model • ALL the phases performed by Moses in hierarchal model are same as phrase passed model but the rule extraction of hierarchal model is differ from phrase based SMT. It include -  Data Preparation • Tokenization • True casing • Cleaning  Training • word alignment • rule extraction • Glue rule • Extract phrase with phrase extraction table • Reordering Model • Language Modelling  Decoding  Tuning  Blue Score
  • 27. Advantage of Hierarchical Model • Hierarchical MT replace redundant rule used in phrase based MT into single rule. • It also overcome the problem of other model it does not require annotated corpora at all or automatically generate it. • We are working on Hindi to English translation English already have annotated data and Hindi will be automatically annotated by hierarchical model . • The grammar used correction in known as synchronous context free grammar.
  • 28. Synchronous Context Free Grammar • SCFG is a kind of context free grammar that generates pair of strings. • Example:- S -> (I, में ) • This rule translates ’I’ in English to में in Hindi. • This rule consists of terminals only but rules may consist of terminals and non-terminals as described below. • VP ->(V1 NP2, NP2 V1 )
  • 29. Rule Extraction with SCFG • Hierarchical model not only reduces the size of a grammar.  It also uses the same rules for parsing as well as translation. Steps performed in rule extraction • In hierarchical Model intervening words can be separated. these are replace by non-terminal X. • Synchronization is required between sub-phrases This model does not require parser at the Hindi side because all phrase are labelled as X.
  • 30. This allow us to build useful translation rule such as X- ( X1 kA X2 , X2 of X1 ) • Some examples • भारत का प्रधान मंत्री- -> Prime Minister of India • जापान का प्रधान मंत्री- -> Prime Minister of Japan • चीन का वित्त मंत्री- -> Finance Minister of China • भारत का राष्ट्रीय पक्षी-> National bird of India • Phrase based model memorises all these phrases, but essentially all phrases have the same structure i.e. • where X1 is prime minister or “प्रधान मंत्री” X2 is India or “भारत”
  • 31. GLUE RULE • Glue rules facilitate the concatenation of two trees originating from the same Nonterminal. Here are the two glue rules. • S-S1 X2, S1 X2 • S- X1, X1 • These two rules in conjunction can be used to concatenate discontigous phrases. So, input to the system is a sentence in hindi and a set of SCFG rules extracted from training set.. • To avoid ruleset of unmanageable size and reduce decoding complexity, we typically set limits on possible rule • At most 2 non-terminal symbol • At least one but at most 5 words/language • Span at most 15 words
  • 32. 3.Working of Syntax Model • Earlier models did not include any linguistic information on trained data which produced grammatically incoherent output. • The persistence of reordering problem in translated text led to development of syntax based model. In this model Moses is trained on syntactic phrases on Target side. • Syntactic information includes root word, word class, POS category. We have syntactic parsing on English language in our work. ADVANTAGES • Since Hindi is syntactically divergent language, this model overcomes the reordering problem faced in phrase based and hierarchical based model. • Syntax based MT performs well in case of structural divergent language. Hindi observes SOV structure while English observes SVO structure. • This model improves the resultant sentence grammatically.
  • 33. MODEL VB PRP VB1 VB2 He adores VB TO Listening TO To MN Music VB PRP VB2 VB1 He TO VB adores TO MN Listening to music REORDERING Cont. …..
  • 34. VB PRP VB2 VB1 He TO VB करत adores ह TO MN Listening क to music Insertion VB PRP VB2 VB1 िह TO VB करत यार ह TO MN नन क क ंगीत Translation िह ंगीत नन क यार करत ह
  • 35. Working • The string-to-tree model accepts a Hindi string as input and seeks across multiple parsed English trees and finds the highest scoring tree. • Input is a string- व्यक्तिगि जीवन • Translation Rules- • [SYM][X] personal [NN][X] [FRAG] ||| [SYM][X] व्यक्ततगत [NN][X] [X] ||| 0.0326378 0.6 0.0652757 1 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 ||| • [SYM][X] personal life [FRAG] ||| [SYM][X] व्यक्ततगत जीिन [X] ||| 0.0326378 0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 ||| • [SYM][X] personal life [TOP] ||| [SYM][X] व्यक्ततगत जीिन [X] ||| 0.0326378 0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 ||| • Decoding by Translation Rules- • [0..3]: [3..3]=</s> [0..2]=S : S ->S -> S </s> :0-0 : c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,-4,6,-11.5445,-5.99562,-7.46699,-1.60944,1.99979,-16.0431) • [0..1]: [1..1]= X [0..0]=S : S ->S -> S X :0-0 1-1 : c=0 core=(0,- 0,1,0,0,0,0,0.999896,0) 0core=(0,-2,3,-3.35156,-0.916291,-2.43527,0,0.999896,- 7.74303) • [0..0]: [0..0]=<s> : S ->S -> <s> :: c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,- 1,1,0,0,0,0,0,0) • [1..1]: [1..1]=personel : X ->X -> व्यक्ततगत :: c=0 core=(0,-1,1,-3.35156,- 0.916291,-2.43527,0,0,0) 0core=(0,-1,1,-3.35156,-0.916291,-2.43527,0,0,- 9.44562)
  • 36. • The target tree it produces is • Output is a string- personal life (TOP <s> (S (NP personal) (NP (NN life)))) </s>)
  • 37. 4.Working of Hybrid Translation • The main disadvantage in Statistical Machine Translation (SMT) is that it only translates phrases which were seen during training. • Unseen phrases such as named entities are not translated . • This leads to low bleu score .We can improve bleu score by translating named entities from external source. Working Preprocessing Translation by Moses Decoder Postprocessing
  • 38. आपको नए <n translation=monastery >आश्रम</n> के ननममाण के लिए ककिने धन की आवश्यकिम है Preprocessing of Data- Moses accept data in following format for hybrid translation- Translation by Moses Decoder- We translate normally using Moses decoder which is trained on our data. The translation using Moses decoder is- How much money you need for the construction of the new आश्रम?? Here word आश्रम is left untranslated. Post processing- The untranslated word can be translated by referring the xml tags. The output obtained is- How much money youo need fr the construction of the new monastery?
  • 39. Result of Hybrid Translation • Exclusive Only the XML-specified translation is used for the input phrase. Any phrases from the phrase table that overlap with that span are ignored. • Inclusive The XML-specified translation competes with all the phrase table choices for that span. • Ignore The XML-specified translation is ignored completely. Xml-exclusive: 7.21 Xml-inclusive 7.36 Xml-ignore 6.18
  • 40. Syntax Model Parsing Extended BERKELEY PARSER We have used Berkeley parser for parsing English language in our project. Since we had parser for English language so we trained our system on string-to-tree and tree-to-string. Input -Economic Services ENJU PARSER With a wide-coverage probabilistic HPSG grammar and an efficient parsing algorithm, this parser can effectively analyze syntactic/semantic structures of English sentences and provide a user with phrase structures and predicate-argument structures.
  • 41. Motivation • Moses accepts data for training syntax model in XML format. • <tree label="NP"> <tree label="DET"> the </tree> <tree label="NN"> cat </tree> </tree> • There are a number of parsers available for parsing. Each parser has its own idiosyncratic input and output format. Hence, we need to process the output of these parser in the format compatible with Moses for syntax model. There are 3 wrapper scripts available in Moses decoder /scripts/training/wrapper for converting the parser output into Moses format. These are- • Parse-en-collins.perl – This script is used with Collins parser available from MIT. • Parse-de-bitpar.perl – This script is used with Bitapar parser available from University of Munich. • Parse-de-berkeley- This script is used with Berkeley parser available from UC Berkeley. • We used Enju parser for our experiment we were motivated to write a wrapper script for this purpose. • Hence we wrote a wrapper script to convert Enju parser output to Moses format compatible for syntax trees.
  • 42. Format Conversion-  We designed a program to convert XML output of Enju parser to Moses compatible XML format. But Enju and Penn Tree Bank have different syntactic categories.  Because the output of Enju is based on HPSG and it is different from the annotation policy of PTB, tree structures and/or syntactic categories are often different from those given by the PTB-style annotation. However, these mappings provide a clear image of what Enju expresses. So we mapped Enju categories to PTB style for our experiment.
  • 43. Steps- 1. For every <sentence> tag , form a output string by adding <tree label =”TOP”> 2. For every <cons> tag i. Retrieve its CAT value ($CAT_VALUE). ii. Retrieve its XCAT value ($XCAT_VALUE). iii. If the XCAT value of the CONS element is non-empty: iv. Find the corresponding POS tag by comparing it with the mapping table. v. Add new tree tag to the given output string by adding <tree label=”CONS_POS”> where CONS_POS is the POS category derived from mapping table. 3. For every <tok> tag i. Retrieve its POS value ($POS_VALUE). ii. Add new tree tag to the given output string by adding <tree label=”POS”> where POS is the POS category derived from POS attribute from tok tag. 4. For every closing </sentence> tag, add new closing </tree> tag. 5. For every closing </cons> tag, add new closing </tree> tag. 6. For every closing </tok> tag, add new closing </tree> tag. 7. All unnecessary attributes are omitted.
  • 44. Challenges- • The deep syntactic parser we used was Enju5 (Miyao and Tsujii, 2005), which is based on HPSG and outputs both (dependency-like) predicate- argument relations (Miyao, 2007) and phrase structure trees (although these do not follow the PTB scheme for phrase structure trees) in an XML format. • The Berkeley is a phrase structure grammar parser based on PBT grammar. • The output of both the parsers differ in tree structure since Enju’s syntactic representation is richer, but still quite challenging. Enju parser produces strictly binary trees while Berkeley parser produces binary trees. Also the tress in the number of levels and structure. • This made the task of converting Enju output to Moses Format difficult. Conclusion - • We trained syntax model on converted Enju output. There was not any major effect on the bleu score.
  • 45. Result INTERFACE - Phrase Based Translation • Input-य क्षत्र यमना पार कहलात ह िै य नई ददल्ली बहत पलों द्िारा भली भांतत जड हए ह • Output-it regions are caled yamuna par and they new delhi these are also joined by many bridges from
  • 46. Hierarchical Based • Input-य क्षत्र यमना पार कहलात ह िै य नई ददल्ली बहत पलों द्िारा भली भांतत जड हए ह • Output- so these regions are caled yamuna par and they from new delhi पलों by भली भांतत जड front are
  • 47. Syntax based  Input-य क्षत्र यमना पार कहलात ह िै य नई ददल्ली बहत पलों द्िारा भली भांतत जड हए ह  Output-it caled yamuna par regions are and it from new delhi of the world the very popular from पलों by भली bridges from are
  • 48. Corpus Type Source Gyan nidhi Downloaded from Joshua Miscellaneous PM speech(July 2015),Budget Data( 2014),Vigyan Prashar magazine ACL2005 Available by Cdac, Noida Agriculture www.pib.gov.in Govt of India
  • 49. Result of Comparing Models of SMT Agriculture ACL 2005 Gyan Nidhi Misc. Phrase 3.48 6.18 3.61 3.45 Heirarchical 3.27 13.8 4.3 5.2 Syntax ST 2.93 10.79 3.21 2.9 Syntax TS 1.2 2.3 0.9 1.5 3.48 6.18 3.61 3.453.27 13.8 4.3 5.2 2.93 10.79 3.21 2.9 1.2 2.3 0.9 1.5 0 2 4 6 8 10 12 14 16 ModelsScore Corpus Comparison of SMT Models Phrase Heirarchical Syntax ST Syntax TS
  • 50. Conclusion We are developing Hindi to English translation system and comparing the results obtained by various models. .During the course of this project, the various models of translation had been evaluated and it is concluded that “Hierarchical based model” is the best approach to carry out this task. The result is verified both on the various English and Hindi sentences corpus. The project concludes with the tasks showing the excellent and desired result as needed. The project, at the end is completed and successfully tested.
  • 51. Future Work We need to – • Perform and compare results of factored model on Moses. • Find and replace OOV words. • Compare the effect of replacing OOV words on blue score. • Transliterate unknown words. • We propose a technique “word to vec” for hybrid translation that can automate the process of generating dictionaries and phrase table.
  • 52. References • Statistical Phrase-Based Translation by Philipp Koehn, Franz Josef Och, Daniel Marcu Information Sciences Institute Department of Computer Science University of Southern California koehn@isi.edu , och@isi.edu , marcu@isi.edu • A Hierarchical Phrase-Based Model for Statistical Machine Translation by- David Chiang Institute for Advanced Computer Studies (UMIACS)University of Maryland, College Park, MD 20742, USA dchiang@umiacs.umd.edu • Philipp Koehn. 2004b. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP), • Richard Zens and Hermann Ney. 2004. Improvements in phrase-based statistical machine translation. In Proceedings of HLT-NAACL 2004, • Hierarchical Phrase-Based Statistical Machine Translation System Mtech. Project Dissertation by Bibek Behera under the guidance of Prof. Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology, Bombay
  • 53. • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.and Herbst, E. (2007). Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for Computational Linguistics.  Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 263–270, Stroudsburg, PA, USA. Association for Computational Linguistics.  Sinha, R. M. K. and Thakur, A. (2005). Machine translation of bi-lingual hindi-english (hinglish) text. 10th Machine Translation summit (MT Summit X), Phuket, Thailand, pages 149–156.Kunal Sachdeva, Rishabh Srivastava, Sambhav Jain, Dipti Misra Sharma  Language Technologies Research Center, International Institute of Information Technology, Hyderabad, Hindi to English Machine Translation: Using Effective Selection in Multi-Model SMT  Amr Ahmed and Greg Hanneman, Syntax-Based Statistical Machine Translation:A review  Aswani, N. and Gaizauskas, R. (2005). A hybrid approach to align sentences and words in English– Hindi parallel corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, pp. 57–64, Ann Arbor, Michigan. Association for Computational Linguistics.  Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing (2nd edition). Prentice Hall