Weitere ähnliche Inhalte Mehr von TAUS - The Language Data Network (20) Kürzlich hochgeladen (20) TAUS MT SHOWCASE, A Small LSP’s Guide to Commercialized Open Source SMT, Tom Hoar, Precision Translation Tools, 10 April 20131. TAUS
MACHINE
TRANSLATION
SHOWCASE
A Small LSP’s Guide To
Commercialized Open Source SMT
15:30 – 15:50
Wednesday, 10 April 2013
Tom Hoar
Precision Translation Tools
2. A Small LSP's Guide
To Commercialized Open Source SMT
From 28 years
of corpus exploitation
Tom Hoar
Precision Translation Tools
3. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 3
4. Origin of MT?
● … the problem of translation could
conceivably be treated as a problem in
cryptography. When I look at an article in
Russian, I say “This is really written in
English, but it has been coded in some
strange symbols. I will now proceed to
decode.”
● March 4, 1947
● From: Warren Weaver, Mathematician Rockefeller
● To: Norbert Wiener, Professor of Mathematics MIT
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 4
5. Origin of Pessimism?
● … as to the problem of mechanical
translation, I frankly am afraid the
boundaries of words in different
languages are too vague and the
emotional and international connotations
are too extensive to make any quasi
mechanical translation scheme very
hopeful.
● April 30, 1947 (day 56 later)
● Norbert Wiener, Professor of Mathematic MIT
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 5
6. Sharing An Experience
● ESL/EFL student:
– “What does 'wanton' mean?”
● Teacher:
– “Where did you see it?”
– “How was it used?”
● Despite this, students learn that meaning
comes from vocabulary, spelling,
grammar, syntax
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 6
7. Working With “Meaning”
● CONTEXT + CONTENT = MEANING
● Context: the container
– i.e. domain, subject, usage, purpose, culture
● Content: anything in the container
– i.e. vocabulary, spelling, grammar, syntax,
punctuation, style
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 7
8. The bird swam to its nest.
● ESL/EFL students: “The meaning is
wrong.”
● Teacher: “Vocabulary, spelling, grammar,
syntax, punctuation are all correct. Why is
the meaning wrong?”
– Students are confused
● Homework: Fix the meaning without
changing the contents.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 8
9. Context Is Determinative
● Possible solution:
– The bird is a duck – or swan, goose, penguin,
cormorant, etc.
● Lesson?
– Change the container – change the meaning
– Machines can’t search for a greater context
● Only humans can
● How often do we look beyond the obvious?
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 9
10. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 10
11. Disclaimer
● Speaker does not have a PhD
● Results from the School of Hard Knocks,
Faculty of Scientific Repetition
● Only affiliation with Moses team is a user
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 11
12. Precision Translation Tools
● Software publisher
– Founded in Feb 2010, Bangkok, Thailand
– Not a translation services provider
– Software, training and support
● “Do” Machine Translation
● “Do” Moses Yourself Community Edition (free)
● Senior managers over 75 years serving
translation professionals and user
documentation
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 12
13. Customers
● Current
– ~300 customers/users
– 30 countries
● Target
– Small & medium LSPs (2-20 persons)
– Translators
● Accomplishments
– First Maori – English SMT system
– First English – Khmer
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 13
14. Mission
● Make statistical machine translation tools
available to everyone with
– Open source foundation
– Simplified usability
– User education and training
– Autonomous ecosystems
– Intellectual property protection
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 14
15. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 15
16. 7 Fundamental Assumptions
● These are essential if SMT is to work.
● They can not be proven.
● They can only be observed through the
success or failure of an SMT system.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 16
17. SMT Assumption 1
● Most of the time, most authors create
content with appropriate
– Vocabulary
– Spelling
– Grammar
– Syntax
– Punctuation
– Style
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 17
18. SMT Assumption 2
● Most of the time, most translators create
translations with appropriate
– Vocabulary
– Spelling
– Grammar
– Syntax
– Punctuation
– Style
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 18
19. SMT Assumption 3
● In large collections of original content,
fragments repeat proportionately to their
occurrences in the real world
green birds fly quickly
red birds fly to the nest
white birds swim across the pond
yellow birds eat sunflower seeds
black birds eat yellow corn
white birds swim gracefully
black birds hover over the nest
pink birds stand on one leg
pink birds eat orange shrimp
grey birds stand in the nest
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 19
20. SMT Assumption 4
● In large collections of translations of
original content, the translations mirror the
repetitions in the original content
los pájaros verdes vuelan rápidamente
los pájaros rojos vuelan al nido
los pájaros blancos nadan en el estanque
los pájaros amarillos comen semillas de girasol
los pájaros negros comen maíz amarillo
los pájaros blancos nadan con gracia
los pájaros negros se ciernen sobre el nido
los pájaros rosados se aguantan sobre una sola pierna
los pájaros rosados comen camarones naranjas
los pájaros grises están en el nido
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 20
21. SMT Assumptions 5 & 6
● Repetitions in past “original content” will
repeat in future content in the same
proportions.
● Mirrored repetitions in past translations of
“original content” will repeat in future
content in the same proportions.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 21
22. SMT Assumption 7
● “Exceptions” are exceptions because they
don't follow normative rules.
– If there’s a rule for a so-called exception, it is
a rule not an exception.
– “Exceptions” occur less frequently than
“norms.” Therefore, they do not significantly
impact the proportions or frequency of
repetitions in the large collections.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 22
23. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 23
24. Machine Learning
● Borrow content from a library
● Study the content
● Retain residual knowledge in memory
● Return the content to the library
● Organize and optimize the knowledge
● Recall and use the residual knowledge to
predict future event
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 24
25. Statistical Machine Translation
SMT Model ● Artificial Intelligence
Configuration
Translation Model ● Study = Train
Language Model
● Memory = Tables
Reordering
Optimize = Tune
Phrase
●
Table
Table
● Predict = Translate
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 25
26. De afbeelding kan niet worden weergegeven. Mogelijk is er onvoldoende geheugen beschikbaar om de afbeelding te openen of is de
afbeelding beschadigd. Start de computer opnieuw op en open het bestand opnieuw. Als de afbeelding nog steeds wordt voorgesteld
door een rode X, kunt u de afbeelding verwijderen en opnieuw invoegen.
What is a model?
De afbeelding kan niet worden weergegeven. Mogelijk is er onvoldoende geheugen beschikbaar om de afbeelding te openen of is de afbeelding beschadigd.
Start de computer opnieuw op en open het bestand opnieuw. Als de afbeelding nog steeds wordt voorgesteld door een rode X, kunt u de afbeelding
verwijderen en opnieuw invoegen.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 26
27. What is a model?
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 27
28. What is a model?
● A representation of an original that
maintain the original’s proportions,
likeness, etc.
● A working model replicates or emulates
the functions of the original
● A statistical model is a working model
– Uses statistical data to “do” something
– Statistical data = numbers about the past
– “Do” something = predict the future
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 28
29. Examples of Statistical Models
● Financial models
predict account
balances
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 29
30. Examples of Statistical Models
● Financial models
predict account
balances
● Weather models
predict hurricanes
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 30
31. Examples of Statistical Models
● Financial models
predict account
balances
● Weather models
predict hurricanes
● Traffic models
predict traffic jams
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 31
32. Examples of Statistical Models
● Financial models
predict account
balances
● Weather models
predict hurricanes
● Traffic models
predict traffic jams
● SMT models
predict translations
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 32
33. Proportions Matter
● Barbie
● Height 6'0"
● Weight 100 lbs.
● Size 4
● 39" x 21" x 33"
● Distorted likeness
● >15% of segments
in EuroParl are
parliamentary
protocol
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 33
34. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 34
35. SMT Statistical Model
SMT Model 1. Make SMT model
Configuration from “original
content”
Translation Model
Language Model
2. Use SMT model to
translate new
Reordering
Phrase
Table
Table
content (predict
translations) without
the “original
content”
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 35
36. Train Translation Model
Original Content
los pájaros verdes vuelan rápidamente
los pájaros rojos vuelan al nido
green birds flyquickly
red birds fly tothe nest
● domt train-tm
los pájaros blancos nadan en el estanque
los pájaros amarillos comen semillas de girasol
los pájaros negros comen maíz amarillo
white birds swimacross the pond
yellow birds eatsunflower seeds
black birds eatyellow corn
train-model.perl
los pájaros blancos nadan con gracia white birds swimgracefully
los pájaros negros se ciernen sobre el nido
los pájaros rosados se aguantan sobre una sola pierna
los pájaros rosados comen camarones naranjas
black birds hover over the nest
pink birds stand on one leg
pink birds eatorange shrimp
● Count frequencies
los pájaros grises están en el nido grey birds stand in the nest
of sentence
fragment pairs
PhraseTable
Source language (stimulus) Target language (response) Probability
los pájaros birds 50%
los birds 50%
negros black 50%
One or more tables
pájaros negros black 50%
los pájaros negros black birds 100%
los pájaros negros comen
los pájaros negros come
n maíz
black birds eat
black birds eat yellow
100%
100%
●
los pájaros negros comen maíz amarillo black birds eat yellow corn 100%
pájaros verdes green 50%
verdes
los pájaros verdes
los pájaros verdes vuelan
green
green birds
green birds fly
50%
100%
100%
– Can reach 15 GB
each
los pájaros verdes
vuelan rápidamente green birds fly quickly 100%
grises grey 50%
pájaros grises grey 50%
los pájaros grises grey birds 100%
los pájaros grises están grey birds stand 100%
los pájaros grises están en grey birds stand in 100%
los pájaros grises están e
n el grey birds stand in the 100%
los pájaros grises están en el nido grey birds stand in the nest 100%
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 36
37. Train Language Model
Target Content
green birds fly quickly
red birds fly to the nest
white birds swim
yellow birds eat
across the pond
sunflower seeds
● domt train-lm
black birds eat
white birds swim
yellow corn
gracefully
black birds hover over the nest
build-lm.sh
pink birds stand on one leg
pink birds eat orange shrimp ● Count frequencies
of sentence
grey birds stand in the nest
Language Model
2-grams :
-1.30713
-0.265492
<s> green
green birds
fragments in target
language
-0.850518 birds fly
-0.677087 birds eat
3-grams :
-0.112767 <s> green birds
One or more tables
-0.421503 birds fly quickly
-0.592076 birds eat yellow
4-grams : ●
-0.10498 <s> green birds fly
Can reach 25 GB
-0.0527335 birds fly quickly </s>
-0.0570311
5-grams :
birds eat orange shrimp –
-0.0732878
-0.0274306
-0.0474597
<s> green birds fly quickly
birds fly to the nest
birds swim across the pond
each
-0.0255669 birds eat yellow corn </s>
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 37
38. Tune SMT Model
[ttable-file]
0 0 5 ${path}/phrase-table.gz domt train-mert
[distortion-file] mert-moses.pl
0-0 msd-bidirectional-fe 6 ${path}/reordering-table.gz
[lmodel-file] Creates optimal
0 0 3 ${path}/irstlm_arpa.en.gz
[weight-t]
settings for the
0.169891 components to
0.0856206
-0.0664389
work together
0.0489578 Configuration file
0.0018491
[ttable-limit] defines paths to
20 files and stores
optimal settings
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 38
39. SMT Statistical Model
SMT Model 1. Make SMT model
Configuration from “original
content”
Translation Model
Language Model
2. Use SMT model to
translate new
Reordering
Phrase
Table
Table
content (predict
translations) without
the “original
content”
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 39
40. SMT Model In Use
● Step 1 domt translate
moses -f config
Translation model
Translation Model creates thousands
los pájaros negros nadan con gracia
possible sentences
Reordering
1 green birds swim gracefully
Phrase
2 red birds swim gracefully
Table
Table
3 black birds swim gracefully
4 yellow birds swim gracefully
5 birds yellow fly green corn
6 red corn eats white pond
...
10,000 pink birds swim gracefully
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 40
41. SMT Model In Use
● Setp 2 Language model
scores each
possible sentence
Language Model
1 green birds swim gracefully 0.38
2 red birds swim gracefully 0.32
3 black birds swim gracefully 0.84
4 yellow birds swim gracefully 0.74
5 birds yellow fly green corn 0.07
6 red corn eats white pond 0.02
… …
10,000 pink birds swim gracefully 0.57
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 41
42. SMT Model In Use
● Step 3 The highest score is
most probable and
selected as the
translation
black birds swim gracefully
3 black birds swim gracefully 0.84
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 42
43. Is This Familiar?
● You have a difficult sentence to translate
● Despite your training and skills, you
create 4 or 5 possible translations with
different words and word orders.
● You struggle
– Which one is “right?”
– Which is the “best?”
● You have to pick one or you don't get
paid.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 43
44. What Drives You?
● How do you make your decision when all
these things are equally “right”
– Meaning
– Grammar
– Syntax
– Etc.
● You have to pick one or you don't get
paid.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 44
45. Feeling and Familiarity
● The one that feels familiar
– Familiarity comes from frequency
● SMT emulates this process
– SMT can generate 10,000-20,000
possibilities. Computers are good at that;
people aren’t.
– SMT calculates the probabilities for each
one. Computers aren’t good at feelings.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 45
46. Stimulus
● “los pájaros negros nadan con gracia”
● English possibilities generated
– green birds swim gracefully
– red birds swim gracefully
– black birds swim gracefully
– yellow birds swim gracefully
– pink birds swim gracefully
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 46
47. Human Response
● “black birds swim gracefully”
– I’m familiar with swans as black birds that
swim gracefully.
– I’m familiar with yellow and pink birds that
swims, but they don’t swim gracefully.
– I’m not familiar with green or red birds that
swim at all.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 47
48. SMT Response
● “black birds swim gracefully”
– All tokens are familiar because they’re in the
tables.
– The fragment “black birds swim” is the most
familiar because it occurs most frequently;
therefore it scores highest.
– The sentence scored highest because its
fragments are in the language model more
frequently.
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 48
49. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 49
50. Initial Challenges
● Requires millions of pairs
● Requires expensive, powerful hardware
● Lacks trained user base
● Faces hostile target users
● Faces criticism from experts
● Lacks professional features
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 50
51. Market Response
● Private SaaS Portals Integrators & Consultants
– Asia Online CrossLang
– SDL 1 Digital Silk Road
– Safaba PangeaMT
– Let's MT Asia Online
– Tauyou Safaba
– Firma8 SDL 1
– KantanaMT IBM
– SmartMATE Systran 2
– Straker Translations LexWorks 2
– Cloudwords Prompsit Language Engineering 3
– AVB Translations Software Publishers
– Lingo24 Systran 2
– MemSource ProMT 3
– Translated.net Precision Translation Tools
– Trusted Translations Notes:
– XTM International 1 = LanguageWeaver not Open Source
2 = SYSTRAN Server, RbMT with Moses
3 = RbMT & SMT options
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 51
52. Learned Challenges
● Customizing models requires possession
and control of TMs
– Users don't entrust TMs to portals
– Perception they're subsidizing competitors
● Portals must continuously create models
– Overhead for each new model
– No portal has talent for every language
– Revert to customer's talents
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 52
53. Updated Challenges
● Requires millions of pairs
● Requires expensive, powerful hardware
● Lacks trained user bases
● Faces hostile, untrained target users
● Faces criticism from experts
● Lacks professional features
● “Trusted 3rd parties” don't exist
● Continual need for new models
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 53
54. Productivity As Quality
● Customers want quality
– Can't define it for computers to test for it
● All automated quality scoring systems
require human reference translations
● 100% match = raw SMT is identical to
independent human translations, not post-
edited translations
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 54
55. 2012 Serendipitous Discovery
● Don't need millions of sentence pairs
within a constrained domain
● PTTools customers with 130K to 300K
segments achieve 100% matches on
20-40% of SMT output
● Let's MT reports similar corpus sizes
produce 20% productivity gains
● Tauyou reports a few as 50K segments
result in customer satisfaction with
productivity gains
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 55
56. Productivity As Quality
● Where does productivity begin?
● How many 100% matches make
productivity gain inevitable?
Quality vs. <100% Match 100% Match Annual Preparation
Productivity (Post-editing) (Productivity) TCO Time
RbMT 90 – 95% 5% – 10% $150,000 2 – 3 weeks
SMT Pre 2007 > 99% < 1% $10,000 1 – 3 weeks
SMT 2007 to 2008 > 99% < 1% $6,000 5 – 14 days
SMT 2009 to 2011 90% – 95% 5% – 10% $1,500 2 – 7 days
SMT 2012 *60% – 80% *20% – 40% $1,200 6 – 48 hours
* actual customer experience
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 56
57. Adjusted Challenges
● 150,000 millions of pairs
Requiresto 300,000 can work fine
● Less than professional graphic arts
Requires expensive, powerful hardware
● Professionals pay bases
Lacks trained userfor training courses
● Attitudes are proportionate to benefits
Faces hostile, untrained target users
● Early criticism from experts
Facesexperts liquidate
● New versions add features
Lacks professionalnew features
● “Trusted 3rd parties” don't exist
● Continual need for new models
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 57
58. Market Response Revisited
● Portals, Full Service, Experts
– Perpetuate perception of complexity
– Control models created with free technology
– Protect investments
● If juke boxes and radio stations preceded
phonographs, what would today’s music
industry sell?
– (a) CD’s
– (b) pay-per-play MP3s and digital radio?
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 58
59. Agenda
● Introduction
● Who is PTTools?
● Fundamental Assumptions
● Models and Proportions
● SMT Statistical Models
● New Perspective
● Acknowledgements
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 59
60. Acknowledgements
● Precision Translation Tools DoMT
®
● Prompsit Language Engineering
● Tauyou
● Safaba Translation Solutions
● LetsMT! by Tilde
● Digital Silk Road
● PangeaMT by Pangeanic
● CrossLang
● KantanMT
● Lingo24
12 april 2013 2012 © Precision Translation Tools Co., Ltd. 60