7. SMT: Pros and Cons
Pros Cons
Quick to build Unpredictable
Cheap Quick
Fluent improvements not
easy
8. Features of an SMT system
• Translation Model: table containing
source and target phrases, together
with a probability score (accuracy)
• Language Model: list of sequences of
n-words in target language together
with a probability score (fluency)
10. Tokenization and recasing
Breaking up text in Lowercase all words
meaningul units (tokens)
File > file
file? > file ?
file. > file .
File! > file !
12. Requirements: size
MS Translator Hub recommends at least
10k segments
I have gotten good results with 100-200k
segments
Roughly over 1 million words corpus
13. Publicly Available Corpora
• Opus (ECB, EMA, OpenOffice)
• Acquis Communautaire
• Europarl
• Hansard
• Multilingual websites: Bitextor
24. DoMY (Basics)
Graphs: import-tmx, clean-LM/TM, build
LM/TM, train, translate.
Ini files: configuration (language pairs,
paths for input and output).
Folder structure: always include
superdomain, domain and subdomain
28. Graphs
Graph Function Input Output
Import-tmx Extract data from Raw Corpora/sa
tmx files
Clean-tm Clean data Corpora/sa Corpora/re
ady
Build-lm Prepares training Corpora/re builds
set for LM ady
Build-tm Prepares training Corpora/re builds
set for TM ady
Train Trains MT engine Builds engines
Translate Translates input Translation Translation
files and produces s/in s/out
tmx output
31. Is Your Engine Good?
A set is excluded from training to be used
for evaluation (598 segments)
From 0.5 BLEU points, engine is likely to
perform well
32. Keep Improving
Retrain the engine periodically as more
translation corpus become available
Gather feedback on what needs to be
improved
33. Statistical PE
• Keep a corpus of raw vs. PE
• Treat them as separate language pairs
• Run them thru DoMY
• Create raw vs. PE engine
• 2 engines: source > target, raw > PE
34. Questions?
Speak now…
Or reach me at:
www.facebook.com/xlation
www.wordbonds.es
@rubendelafuente
http://www.linkedin.com/in/rubendelafuente
Hinweis der Redaktion
Why? SMT is based in probability, calculated as # of a given token / total amount of tokens. Case and punctuation can disrupt the calculation.
To get good results with SMT, you need around 10.000 segments at least
Using Olifant from Okapi Framework
Clean data: remove too long/short, empty sentences