SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
“Data & Linguistics”
Delivering Machine Translation with
Subject Matter Expertise
John Tinsley
Director / Co-Founder
Localization World. 31st Oct 2014, Vancouver
Machine Translation
with Subject Matter Expertise
From Data Engineering to
Linguistic Engineering
“Ensemble” MT architecture
The world’s first and only patent specific
MT system that’s ready to go
Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Patents: an MT nightmare
L is an organic group selected from -CH2-
(OCH2CH2)n-, -CO-NR'-, with R'=H or
C1-C4 alkyl group; n=0-8; Y=F, CF3 …
maximum stress of 1.2 to 3.5 N/mm<2>
and a maximum elongation of 700 to
1,300% at 0[deg.] C.
Long Sentences
Technical constructions
Largest single document: 249,322 words
Longest Sentence: 1,417 words
“Most of these things are not like the other”
Many languages aren’t a dream either
(And teaches the teacher her students language the Arabic)
Spanish – Italian English – Spanish Arabic – English
Data Engineering
What is Linguistic Engineering?
Pre-processing Post-processing
Input Output
Training Data
Data Engineering + Linguistic Engineering
An “ensemble” architecture
Chinese pre-ordering
rules
Statistical
Post-editing
Input
Output
Training Data
Spanish med-device
entity recognizer
Multi-output
Combination
Korean pharma
tokenizer
Patent input
classifier
Client TM/terminology (optional)
Japanese script
normalisation
German
Compounding rules
Moses
RBMT
Moses
Moses
Easier said than done
“A very particular set of skills”
MT Knowledge
(from a scientific
perspective)
Domain Knowledge
(the nature of the
content)
Linguistic Knowledge
(the characteristics of
the language)
MT Knowledge
Implementation
•  Computer science!
•  Programming
•  Data structures
•  Algorithms
Science
•  Machine learning
•  Probability theory
•  Bayesian statistics
•  Markov Models
Domain Knowledge
What’s important?
•  Chemical names
•  References to figures
•  Claim cross-references
Where do we learn?
•  Commercial partners
•  LSPs & Translators
•  Research
Consistent across langs?
•  Japanese abstract order
•  Numbering / bullets
•  Document layout
Document types?
•  Patents
•  Applications, reports
•  Pharmaceutical
•  IFUs, labels
Iconic
Translation Machines
Linguistic Knowledge
Number agreement: the house / the houses vs. la maison / les maisons
Gender agreement: the house / the cheese vs. la maison / le frommage
English - Spanish
English - French
Linguistic Knowledge
English - German
English - Chinese
种水果的农民
The farmer who grows fruit
[Lit: “grow fruit (particle) farmer”]
If you don’t understand it, you can’t translate it
MT with Subject Matter Expertise
“Allopurinol-induced serious cutaneous adverse
reactions (SCAR), including Steven Johnson’s syndrome
(SJS) and toxic epidermal necrolysis (TEN), are
associated with a genetic marker, the HLA-B*5801
allele.”
“IPTranslator is perfect for someone who needs to search [patents]
across multiple languages and with is useful in the case of both
patentability and infringement searches.”
– Aalt van de Kuilen, Global Head of Patent Information, Abbott
Machine Translation for Patents
What is the value for users?
Specialist solutions deliver more useable outcomes for the user
Post-editing
For information purposes
Multilingual search
Increased productivity
Extract more meaning
Retrieve more relevant results
=
=
=
De-risking the machine translation proposition
What is the value for users?
+ Data
+ Time
+ €€€
= ???
+ No data needed
+ Systems are ready to go
+ No upfront cost
= Evaluate immediately
New PrerequisitesTypical Prerequisites
Customisation. Refinement.
» Incorporation of user feedback
» Incremental training with post-edits
» Tuning for specific input types
Case Studies
1.  What this approach means straight up in terms of quality…
2.  Productivity gains from using these systems…
3.  As a foundation for client customization…
Case 1: Quality
0
5
10
15
20
25
30
35
40
45
50
Iconic
Google
Systran
Portuguese to English
Case 1: Quality
2.83
4 3.86
3.56
1
1.5
2
2.5
3
3.5
4
4.5
5
Evaluator 1 Evaluator 2 Evaluator 3 Average
German to English TranslationGerman to English
Case 2: Productivity
Iconic had a domain-specific MT solution for that industry
Machine Translation technology for the legal industry
Business Need
Case 2: Productivity
Delivered immediately and initial results were positive
Translation samples required for initial evaluation
Process (1)
Case 2: Productivity
“The complexities and unforeseen but inevitable surprises of MT
integration in large scale production processes were handled both
competently and efficiently.”
Integrate Iconic with GlobalSight for productivity pilot
Process (2)
Case 2: Productivity
>20% productivity increase for translator post-editing Iconic output
“Measurable productivity gains delivered from the outset”
Performance
Case 2: Productivity
•  Ongoing improvement through feedback from translators
•  Ongoing improvement through the incorporation of post-edits
•  More than 5 million words translated to date for Asian languages
•  Periodic roll-out of new languages over time
Looking forward
Case 3: Customization
-  Modify our patent machine translation engines for
“Written Opinions” on patents
-  0.25% new data, 2 new ensemble processes
21 20
27
0
10
20
30
40
50
60
Iconic Google
+ Modification
Baseline
Chinese to English
Case 3: Customization
Productivity
threshold
Essentially out of domain – not viable for post-editing
Case 3: Customization
Productivity
threshold
After customization – 25% gain in productivity
All content is not created equal
We cannot afford to be dogmatic when it
comes to MT
Know your subject matter!
Domain specific MT is about more than just
data
Take home messages…
+ Linguistics!
Thank You!
john@iptranslator.com
@IconicTrans

Weitere ähnliche Inhalte

Andere mochten auch

شهاده خبره محمد جلال
شهاده خبره محمد جلالشهاده خبره محمد جلال
شهاده خبره محمد جلالMahmoud Aly
 
Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...TAUS - The Language Data Network
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Iconic Translation Machines
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of PatentsIconic Translation Machines
 
MT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the TreesMT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the TreesIconic Translation Machines
 
What? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsWhat? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsIconic Translation Machines
 

Andere mochten auch (10)

شهاده خبره محمد جلال
شهاده خبره محمد جلالشهاده خبره محمد جلال
شهاده خبره محمد جلال
 
Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...
Quality estimation: the Holy Grail in the MT scene (GĂĄbor Bessenyei, CEO of M...
 
Plantilla hecha bien 2
Plantilla hecha bien 2Plantilla hecha bien 2
Plantilla hecha bien 2
 
What Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred TuinstraWhat Data would you like to Track? - Fred Tuinstra
What Data would you like to Track? - Fred Tuinstra
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
 
Topic 2: How to Pump up Your MT Quality (5)
 Topic 2: How to Pump up Your MT Quality (5) Topic 2: How to Pump up Your MT Quality (5)
Topic 2: How to Pump up Your MT Quality (5)
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents"Machine Translation 101" and the Challenge of Patents
"Machine Translation 101" and the Challenge of Patents
 
MT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the TreesMT Evaluation: Seeing the Wood for the Trees
MT Evaluation: Seeing the Wood for the Trees
 
What? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projectsWhat? Why? How? Factors that impact the success of commercial MT projects
What? Why? How? Factors that impact the success of commercial MT projects
 

Ähnlich wie Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationIconic Translation Machines
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...John Tinsley
 
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS - The Language Data Network
 
From the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchFrom the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchIconic Translation Machines
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP
 
Gala Webminar September 2013
Gala Webminar September 2013Gala Webminar September 2013
Gala Webminar September 2013pangeanic
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliDeep Learning Italia
 
Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...Loctimize GmbH
 
Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21lurching
 
Translation Technologies & Business in the Future
Translation Technologies & Business in the FutureTranslation Technologies & Business in the Future
Translation Technologies & Business in the FutureMultilizer
 
Going eXtreme for Healthcare
Going eXtreme for HealthcareGoing eXtreme for Healthcare
Going eXtreme for HealthcareKoen Vanderkimpen
 
Internationalizing a Complex B2B Application
Internationalizing a Complex B2B ApplicationInternationalizing a Complex B2B Application
Internationalizing a Complex B2B Applicationbobdonaldson
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorScott Abel
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorScott Abel
 
Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21lurching
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedinfredleoni
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedinfredleoni
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)TAUS - The Language Data Network
 
The data limbo in modern biomedical research
The data limbo in modern biomedical researchThe data limbo in modern biomedical research
The data limbo in modern biomedical researchJorge Boucas
 
Joaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer ImpactJoaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer ImpactJoaquin Pe Fagundo
 

Ähnlich wie Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise (20)

The Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine TranslationThe Latest Advances in Patent Machine Translation
The Latest Advances in Patent Machine Translation
 
Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...Past, Present, and Future: Machine Translation & Natural Language Processing ...
Past, Present, and Future: Machine Translation & Natural Language Processing ...
 
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation MachinesTAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
TAUS MT Showcase, Beyond Data, John Tinsley, Iconic Translation Machines
 
From the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT ResearchFrom the Lab to the Market: Commercialising MT Research
From the Lab to the Market: Commercialising MT Research
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
Gala Webminar September 2013
Gala Webminar September 2013Gala Webminar September 2013
Gala Webminar September 2013
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
 
Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...Introducing language technology in the editing process: How to do things righ...
Introducing language technology in the editing process: How to do things righ...
 
Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21Ajinomatrix v0.6 30-08-21
Ajinomatrix v0.6 30-08-21
 
Translation Technologies & Business in the Future
Translation Technologies & Business in the FutureTranslation Technologies & Business in the Future
Translation Technologies & Business in the Future
 
Going eXtreme for Healthcare
Going eXtreme for HealthcareGoing eXtreme for Healthcare
Going eXtreme for Healthcare
 
Internationalizing a Complex B2B Application
Internationalizing a Complex B2B ApplicationInternationalizing a Complex B2B Application
Internationalizing a Complex B2B Application
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical Communicator
 
Living Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical CommunicatorLiving Multiple Lives: The New Technical Communicator
Living Multiple Lives: The New Technical Communicator
 
Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21Ajinomatrix v0.5 28-08-21
Ajinomatrix v0.5 28-08-21
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedin
 
Frederic Leoni Linkedin
Frederic Leoni LinkedinFrederic Leoni Linkedin
Frederic Leoni Linkedin
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
The data limbo in modern biomedical research
The data limbo in modern biomedical researchThe data limbo in modern biomedical research
The data limbo in modern biomedical research
 
Joaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer ImpactJoaquin Pe Fagundo | Technology Transfer Impact
Joaquin Pe Fagundo | Technology Transfer Impact
 

KĂźrzlich hochgeladen

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

KĂźrzlich hochgeladen (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Data and Linguistics: Delivering Machine Translation with Subject Matter Expertise

  • 1. “Data & Linguistics” Delivering Machine Translation with Subject Matter Expertise John Tinsley Director / Co-Founder Localization World. 31st Oct 2014, Vancouver
  • 3. From Data Engineering to Linguistic Engineering
  • 5. The world’s rst and only patent specic MT system that’s ready to go
  • 6. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data
  • 7. Patents: an MT nightmare L is an organic group selected from -CH2- (OCH2CH2)n-, -CO-NR'-, with R'=H or C1-C4 alkyl group; n=0-8; Y=F, CF3 … maximum stress of 1.2 to 3.5 N/mm<2> and a maximum elongation of 700 to 1,300% at 0[deg.] C. Long Sentences Technical constructions Largest single document: 249,322 words Longest Sentence: 1,417 words
  • 8. “Most of these things are not like the other” Many languages aren’t a dream either (And teaches the teacher her students language the Arabic) Spanish – Italian English – Spanish Arabic – English
  • 9. Data Engineering What is Linguistic Engineering? Pre-processing Post-processing Input Output Training Data
  • 10. Data Engineering + Linguistic Engineering An “ensemble” architecture Chinese pre-ordering rules Statistical Post-editing Input Output Training Data Spanish med-device entity recognizer Multi-output Combination Korean pharma tokenizer Patent input classifier Client TM/terminology (optional) Japanese script normalisation German Compounding rules Moses RBMT Moses Moses
  • 11. Easier said than done “A very particular set of skills” MT Knowledge (from a scientific perspective) Domain Knowledge (the nature of the content) Linguistic Knowledge (the characteristics of the language)
  • 12. MT Knowledge Implementation •  Computer science! •  Programming •  Data structures •  Algorithms Science •  Machine learning •  Probability theory •  Bayesian statistics •  Markov Models
  • 13. Domain Knowledge What’s important? •  Chemical names •  References to figures •  Claim cross-references Where do we learn? •  Commercial partners •  LSPs & Translators •  Research Consistent across langs? •  Japanese abstract order •  Numbering / bullets •  Document layout Document types? •  Patents •  Applications, reports •  Pharmaceutical •  IFUs, labels Iconic Translation Machines
  • 14. Linguistic Knowledge Number agreement: the house / the houses vs. la maison / les maisons Gender agreement: the house / the cheese vs. la maison / le frommage English - Spanish English - French
  • 15. Linguistic Knowledge English - German English - Chinese 种水果的农民 The farmer who grows fruit [Lit: “grow fruit (particle) farmer”]
  • 16. If you don’t understand it, you can’t translate it MT with Subject Matter Expertise “Allopurinol-induced serious cutaneous adverse reactions (SCAR), including Steven Johnson’s syndrome (SJS) and toxic epidermal necrolysis (TEN), are associated with a genetic marker, the HLA-B*5801 allele.” “IPTranslator is perfect for someone who needs to search [patents] across multiple languages and with is useful in the case of both patentability and infringement searches.” – Aalt van de Kuilen, Global Head of Patent Information, Abbott Machine Translation for Patents
  • 17. What is the value for users? Specialist solutions deliver more useable outcomes for the user Post-editing For information purposes Multilingual search Increased productivity Extract more meaning Retrieve more relevant results = = =
  • 18. De-risking the machine translation proposition What is the value for users? + Data + Time + €€€ = ??? + No data needed + Systems are ready to go + No upfront cost = Evaluate immediately New PrerequisitesTypical Prerequisites Customisation. Refinement. » Incorporation of user feedback » Incremental training with post-edits » Tuning for specific input types
  • 19. Case Studies 1.  What this approach means straight up in terms of quality… 2.  Productivity gains from using these systems… 3.  As a foundation for client customization…
  • 21. Case 1: Quality 2.83 4 3.86 3.56 1 1.5 2 2.5 3 3.5 4 4.5 5 Evaluator 1 Evaluator 2 Evaluator 3 Average German to English TranslationGerman to English
  • 22. Case 2: Productivity Iconic had a domain-specific MT solution for that industry Machine Translation technology for the legal industry Business Need
  • 23. Case 2: Productivity Delivered immediately and initial results were positive Translation samples required for initial evaluation Process (1)
  • 24. Case 2: Productivity “The complexities and unforeseen but inevitable surprises of MT integration in large scale production processes were handled both competently and efficiently.” Integrate Iconic with GlobalSight for productivity pilot Process (2)
  • 25. Case 2: Productivity >20% productivity increase for translator post-editing Iconic output “Measurable productivity gains delivered from the outset” Performance
  • 26. Case 2: Productivity •  Ongoing improvement through feedback from translators •  Ongoing improvement through the incorporation of post-edits •  More than 5 million words translated to date for Asian languages •  Periodic roll-out of new languages over time Looking forward
  • 27. Case 3: Customization -  Modify our patent machine translation engines for “Written Opinions” on patents -  0.25% new data, 2 new ensemble processes 21 20 27 0 10 20 30 40 50 60 Iconic Google + Modification Baseline Chinese to English
  • 28. Case 3: Customization Productivity threshold Essentially out of domain – not viable for post-editing
  • 29. Case 3: Customization Productivity threshold After customization – 25% gain in productivity
  • 30. All content is not created equal We cannot afford to be dogmatic when it comes to MT Know your subject matter! Domain specific MT is about more than just data Take home messages… + Linguistics!

Hinweis der Redaktion

  1. In this presentation, I’m going to talk about our experience of developing machine translation engines for complex content and languages. Looking at where were get to when we reach the limitations existing technology and approaches, particularly focusing on WHY we reached that ceiling *WHAT was it about the content and the language the could be overcome. From there, I’ll look at what we need to do to advance the technology from there and, FROM OUR PERSPECTIVE as MT technology developers and providers, tell you about what we discovered we needed to know, what skillsets and knowhow we needed in our team to achieve this. I’ll then WRAP UP with some case studies which will serve to illustrate the benefits that can be seen as a result of taking this approach. For DEVELOPERS, I hope we can share our experiences with you, and for BUYERS OR USERS OF MT, my hope is that, from your perspective, this talk will pull back the curtain a little bit on MT development, which has kinda been a bit of a black box.
  2. Just a little bit by way of an overview of Iconic Translation Machines to introduce the concepts I'm going to talk about. We develop what we call “MT with Subject Matter Expertise” The concept is that if you are hiring a professional translator for a job, beyond their language skills they also need to have subject matter expertise, particularly for technical content. *And the same applies to MT technology* ----- Meeting Notes (14/10/2014 12:52) ----- Our philosophy
  3. High quality data is essential for most effective approaches to MT. Clean data is engineering to build MT systems. But it is just an ingredient. You still need to cook the data for the specific language, the specific content type and writing style. This varies from language to language, domain to domain. We need to know how to cook it, we need to understand the language, the content, the style and not only take this into account, but make integral to the development process. This is linguistic engineering.
  4. How do you go about building such a concept? To answer this, I want to introduce the concept of the ensemble architecture for machine translation As a developer, you cannot be dogmatic when it comes to approaches to MT. There are many approaches, you cab be a statistical MT vendor, we you can focus on Moses, you can use a rule-based MT. Or you might do some sort of hybrid MT. In the “ensemble” approach, WE DO ALL OF THEM. Sometimes we use them all at the same time. Sometimes we only use one. It’s completely dependent on what works best for a given content type, style, and language together. e.g. for Chinese-English patent MT, maybe you need a statistical decoder, with some rules for automatic post-editing Maybe for French-English abstract translation, an SMT system along suffices. Maybe for Japanese-English titles, we can just use some rules, and maybe some machine learning based pre-processes. You study. You learn what ensemble works for a particular configuration and that’s what you implement.
  5. An instance of this approach is our IPTranslator service for patent/IP/legal translation and I’ll mention patents as an example of a highly complex content type as I go through the rest of the presentation.
  6. TO understand this Linguistic Engineering approach, let’s first describe DATA ENGINEERING Existing approaches to MT typically use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent – AND THAT’s WHY IT’S USED, BECAUSE IT CAN WORK - but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you often need A LOT of data (and many clients simply don’t have it.) But then your being completely reliant on the data to capture all of the nuances of language and content, and this isn’t enough. We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the content being translated, often technical nuances, terminology etc. that needs to be specially accounted for. ***ALSO need to develop special processes for languages… LET’S LOOK AT WHY
  7. But of course it’s not just that easy. Patents for example have a range of highly complex linguistic characteristics that make this challenging, both for PROFESSIONAL translators as well as for Translation Software. Lets look for example at this patent – what’s highlighted in blue is a SINGLE sentence, (which is an individual legal claim). Additionally, we have to deal with complex technical constructions such as chemical formulae, alphanumeric sequences, even genomic and amino acid sequences.
  8. To quote Sesame Street…or to slightly modify a line from a famous Sesame street song… “Most of these things are not like the other” AS A RULE OF THUMB, the more similar languages are to one another, the easier they are for machine translation. Particularly in terms of the order of the words in sentence, and then also grammatically. The closer they are, the more you can get away with just using statistical MT and throwing lots of data into the system. But most of them are not like one anothr But what if the languages are SO grammatically different from all perspectives?! Like English and Arabic, where Arabic has a different word order, frequently doesn’t have a verb, affixes pronouns, articles, and conjunctions to verbs (when they ARE there) and nouns. Look at this example which shows many of these phenomenon together. Firstly, the words are in a totally different order if we read it out as it would be word for word… and it manages to say all that in 5 words due to all the affixes, compounding, and morphology Data cannot solve these problems either. Each one of these phenomena needs to be addressed. And that’s where the linguistic knowledge and linguistic engineering comes in…
  9. Existing vendors or MT providers use the follow process – if a client wants a machine translation system for a certain domain, say IT, they provider the vendor with training data and this gets churned through the various generic processes for each language required. The idea is that by pumping in data in the IT domain that an IT machine translation system comes out at the end. It’s true to a certain extent but the reality is that the quality often doesn’t cut the mustard. The problem with the data engineering approach is that you need A LOT of data and many clients simply don’t have it. We’ve develop methods to manipulate the machine translation system by designed processes that are highly specific to the CONTENT being translated, often technical nuances, terminology etc. that needs to be specially accounted for , ASWELL as the LANGUAGE being translated which again cannot just be a generic process.
  10. Let’s get rid of the concept of a central MT system – statistical, hybrid or whatever. Yes we have training data and input, we’ll have some output, and some processes, but what is the journey?... Combining these factors is a delicate balance. Something the smallest change can effect things. Sometimes big changes have no effect. It really depends on your training data. That presents a challenge when the training data changes for each system that’s built. LATER, I’ll come back to this and look at some examples where we have QUANTIFIED the impact and the value in taking this approach BUT FIRST, I want to talk about WHY we took this approach and WHAT we learned over the course of the last few years… ----- Meeting Notes (14/10/2014 12:52) ----- **Good if you can develop the systems with the training data that you know you're going to use...
  11. THAT’S WHAT’S REQUIRED AND DEVELOPMENT OF THE VARIOUS COGS IS AN ONGOING PROCESS. However, as with most areas of natural language processing (like MT itself as the over-arching process) these things aren’t perfect. You know the way MT is improving, well so is syntactic parsing of German, named-entity recognition in Japanese, Arabic morphologic analysis so it’s about constant iterative improvement. THAT’S WHY THERE ARE NO BREAKTHROUGHS, NO SILVER BULLETS IN MT DEVELOPMENT. We work hard, we improve our German parsing, we improve our German systems a bit… But all of that is easier said than done. When building a technical team to do this, we have to look closely at what sort of skillset we need. Let me tell you, what we came across is quite the high bar. It’s a talent pool that’s thin on the ground for a number of reasons, which I’ll get to… To quote another movie, a compatriot of mine, Liam Neeson in the film Taken “You need a very particular set of skills”. Now his is not per person, but these are skills you really need to have within your team to get the most you can out of your MT systems **NOW START SLIDES** Over the course of our existence, we’ve identify three key areas in which you need to have expertise in order to be able to develop adequate MT engines for different languages and content types… 1…2…3 ----- Meeting Notes (14/10/2014 15:58) ----- 16 minutes to here
  12. Let’s look first at MT knowledge. THIS IS NOT JUST KNOWING HOW TO RUN MOSES. You can’t treat it as a black box. I believe MT knowledge here is two-fold. You have to know the science (THEORY), and you have to know to implement the science (PRACTICE). They don’t always go hand in hand…we’re talking implementation from a product development perspective, not from a “let’s hack together my idea in some scripts held together by string so that I can write a paper about my results and it doesn’t really matter how efficiently it works!” So then if know the theory, we know how to develop a maximum-entropy classifier to identify chemical names in Korean – we then need to understand the mechanics of the MT engine in order to implement this along with all of the other components in an efficient manner. Examples of machine learning methods: support vector machines, decision trees, neural networks Examples of probability models: Baysian, HMMs, Maximum Likelihood Example of programming language/styles: Java, python, C++, MapReduce Examples of data structures: hashmaps, databases, Example of algorithms: sorting/searching, parsing, OUTRO: one of the biggest challenges in this regard is finding talent with this skillset. MT grads and postgraduates are thin on the ground and many of them are on an academic career path. Couple that with the fact that the research groups are dotted around the word makes hiring a real challenge. There was actually an interesting panel about this at the AMTA conference…
  13. So that’s what you need to be able to develop with the MT. With that, what is it that you actually need to develop? Well, we can split this into two sets of components that need to work together. first is those for the DOMAIN, and then those for the LANGUAGE itself. Looking at the DOMAIN KNOWLEDGE required first, what do we need to know? 1. WHAT’S IMPORTANT IN THIS DOMAIN? 2. WHAT TYPES OF DOCUMENTS ARE THERE? 3. ARE THESE CHARACTERISTICS CONSISTENT ACROSS LANGUAGES? 4. WHERE DO WE FIND THIS INFORMATION OUT?
  14. The last piece in the puzzle is understanding the languages you’re developing MT systems for. And that’s not understanding them in isolation – that’s understanding THE RELATIONSHIP between the languages you’re translating to and from, what the differences are between them e.g. many of the things we need to look out for when developing English-Spanish translation engines we don’t need to do for French-Spanish translation
  15. With certain language pairs, things get more complex. The processes that we need to develop are harder to develop, less studied, require smarter people! Chinese, need to identify these DE constructions so we know to move the head noun No tense, going into English, how do we know what tense? There’s no article! We have to generate it! DE particle has many translations, which one! FIRST THINGS FIRST, which ones are the words!? We need to segment the Chinese! ONLY WITH THESE SKILLS CAN YOU EXPLOIT THE TECHNOLOGY TO ITS FULLEST – AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE
  16. AND WHAT DO WE GET IN DOING THIS? MT WITH SUBJECT MATTER EXPERTISE The whole motivation for this is that same as if you’re hiring a linguist for translation, they simply need to have technical subject matter expertise. Otherwise, how can they understand everything? “If you don’t understand it, you can’t translate it” The same applies to MT. The training and translation process needs to know what it’s dealing with so it can use the right terms, do the right preprocessing, etc. That’s what we’ve done with our flagship offering, IPTranslator. Systems have subject matter expertise because they were developed with, and evaluated and used by patent information specialists.
  17. General advantages of this approach to MT ANALOGY of buying fresh fruit…
  18. Obviously one of the issues in adopting machine translation technology is the risk that’s involved. You invest in a program, it doesn’t deliver straight away, it might start brining you returns but when? How long? If ever If we look specifically at the approach we’ve taken: Our proposition helps to derisk the adoption of MT from A QUALITY PERSPECTIVE and a DELIVERY PERSPECTIVE Typical setup involves: data, across all languages. How much do you have? Is that enough? Is it clean? Is it yours to give away? Time, how long is development going to take? Will MT be good enough straight away after that? If not, when? What’s the upfront cost for customisation or subscription to the service?
  19. That’s the value for the users for the whole concept, but what if we get down to the nuts and bolts of it and talk about the value in terms of the returns…what does using this type of MT get you? To give an illustration, I’ll run through 3 quick examples and case studies from our own experiences. The first of which will look at what this does in terms of straight up quality of the MT output After that, no pun intended, we’ll see how that translates to productivity when post-editing the output Finally, we’ll look at what you can do when you have these systems built and ready to go in terms of customisation, with minimal effort..
  20. All of these examples are using our IPTranslator systems which have been developed for patent machine translation. First, in terms of MT quality an BLEU scores, here are evaluation results for our Portuguese to English engines across 8 different patent technical areas. Now, while the BLEU scores don’t necessarily have too much meaning by themselves, there’s a clear distinction in the quality of the Iconic output compared to Google Translate and an out-of-the-box Systran engines. These engines are comparable here because we take the assumption that the client has no additional data with which to build an engine from scratch, so we need an “existing” option. These results correlated well with human assessment of adequacy, another of which we can look at here…
  21. For our German to English system, we had 3 evaluators look at around 400 segments each and rank them from 1-5 in terms of how adequately the carried the meaning from the source to the target. Typically, a score of 3 or high indications the the segments are “usable” – i.e. readable and understandable So they’re just a couple of brief examples to show that this approach is developing systems that can produce good quality output, without the need for additional adaption for each individual user. I want to now look at a case study that illustrates how these systems, as they are, with these levels of quality, can produce output that leads to more productive post-editing…
  22. This is a case study with WeLocalize who had a particular business need…
  23. For English to Chinese MT…
  24. Used on a daily basis So this ongoing improvement through incorporation of client-specific data is related to our third case study about how these engines that we’ve building with linguistic engineering can serve as a solid backbone for customized engines…
  25. This is a case with another of our clients who have a substantial patent translation business. They had a slightly different need in that, rather than the translation of patent documents themselves, they wanted to translation what are known as Written Opinions, essentially reports from patent examiners about the validity of a patent application. From an MT perspective, when a lot of the technical terminology is the same, the register is completely different. These written opinions contain first person, questions, opinions – sentence structures and words that just aren’t in patents and consequently not in our original systems. If we looked at how our systems performed when trying to handle this, we get a BLEU score of around 21 where Google, a system designed for whatever’s thrown at it, gets a score of 20 – so around the same. What we need to do is modify these systems for this particular type of text. What we had at hand to do this was some TMs from our client, not much though, it amounted of around 0.25% of the amount of data we’d trained our original engines with. We also developed a couple of processes to add to our ensemble architecture to handle specifics of these Reports, such as consistent references to PCT (patent cooperation treaty) Regulations. This resulted in the performance more than doubling….
  26. In terms of how this correlated into post-editing productivity for the client, well let’s look at this scatter plot. Each dot is a segment in our test. Along the horizontal axis we have the length of the segment in words. On the vertical axis we have a proprietary score that correlates with post-editing productivity whereby a score of 0.4 means, roughly, there’ll be some productivity from post-editing. Above means most likely not, and the lower, the less editing is required. So here we can see that only a small portion of the segments fall below the threshold so, basically, the document (which is essentially out of domain) is NOT VIABLE for this MT system.
  27. However, AFTER we do the customisation we see that a large number of the segments drop below the line, a bit over 60% of them, with quite a few hitting the 0 score also. When we run the number of these, they lead to productivity gains of around 25% ----- Meeting Notes (13/10/2014 17:03) ----- The heavy lifting has been done
  28. Some of these points may be obvious, but allow me to elaborate All content is not created equal (to modify a well know phrase); as such, the (machine) translation process has to be different We cannot afford to be dogmatic when it comes to MT; one size does not final all. If we are practitioners of SMT, we’re restricting ourselves. Even being “hybrid” is restrictive. It’s SMT + rules, or rule-based + statistical post-editing. Domain specific MT is about more than just data; a sufficient amount of good quality clean training data is obviously a key component in the MT training process (especially for SMT) but it’s no everything. To use a cooking analogy, data is to MT what the ingredients are to a chef. The chef (in this case the training/development process) needs to know what to do with the ingredients. To bring it back to MT, the training and translation processes need to be informed by the data, by the content type and the subject matter. Training is sensitive to data. So you could have the most refined approach but data will be the biggest variable to quality. Our approach allows us to deliver high quality “out of the box” which we then refine as opposed to the great unknown of training from scratch.