Build your own statistical engines

•

2 gefällt mir•1,456 views

Rubén Rodríguez de la Fuente

Technologie

Build Your Own
Statistical Machine
Translation Engines
Ruben de la Fuente

About Me

• 4-year degree in translation
• Worked as translator for 10+ years
• Working full time in MT for the past
year

Agenda

• Quick comparison with RbMT
• Fundamentals of SMT
• Requirements and preparation
• Using DoMY

Disclaimer

• I’m not saying SMT is better
• I’m not saying SMT is right for you

Statistical Machine Translation

Computer learns to translate through
statistical analysis of alignment in
bilingual corpora

Rule-based Machine Translation

User Dictionaries + Grammar and
translation rules

SMT: Pros and Cons
Pros Cons

Quick to build Unpredictable
Cheap Quick
Fluent improvements not
easy

Features of an SMT system

• Translation Model: table containing
source and target phrases, together
with a probability score (accuracy)
• Language Model: list of sequences of
n-words in target language together
with a probability score (fluency)

Language and Translation Models
• LM (fluency) • TM (accuracy)

Tokenization and recasing
Breaking up text in Lowercase all words
meaningul units (tokens)
File > file
file? > file ?
file. > file .
File! > file !

Requirements: Computing

•4 GB RAM PC needed
•Ubuntu 10.04 64-bit OS
•Virtual Machine OK

Requirements: size

MS Translator Hub recommends at least
10k segments
I have gotten good results with 100-200k
segments
Roughly over 1 million words corpus

Publicly Available Corpora

• Opus (ECB, EMA, OpenOffice)
• Acquis Communautaire
• Europarl
• Hansard
• Multilingual websites: Bitextor

Bitextor is Cunning

www.mywebsite.com/en/overview.html
www.mywebsite.com/es/overview.html
<title>My source text</title>
<title>My target text</title>

Requirements: relevance

Data needs to be in-domain

Requirements: quality

Garbage in, garbage out
Diagnose your TMs with automated QA
checks (e.g. glossary adherence, length)

Remove Markup

Markup brings noise to the learning
process
Click <strong>Send</strong>
Haga clic en <strong>Enviar</strong>

Do-Moses-Yourself (DoMY)

Moses: state-of-the-art extensively used
open source SMT toolkit
DoMY: extension of Moses making
installation and configuration easier

Online SMT Portals
Cons
letsmt.eu
NDA-compliance
smartmate.co Availability
Speed

DoMY (Basics)

Graphs: import-tmx, clean-LM/TM, build
LM/TM, train, translate.
Ini files: configuration (language pairs,
paths for input and output).
Folder structure: always include
superdomain, domain and subdomain

Graphs
Graph Function Input Output
Import-tmx Extract data from Raw Corpora/sa
tmx files
Clean-tm Clean data Corpora/sa Corpora/re
ady
Build-lm Prepares training Corpora/re builds
set for LM ady
Build-tm Prepares training Corpora/re builds
set for TM ady

Train Trains MT engine Builds engines
Translate Translates input Translation Translation
files and produces s/in s/out
tmx output

Tips for settings

LM: 7-gram
TM: 9-gram
Aligner: Berkeley for distant languages

Troubleshoot

Error message in terminal
Log file in graph folder
DoMT QA

Is Your Engine Good?

A set is excluded from training to be used
for evaluation (598 segments)
From 0.5 BLEU points, engine is likely to
perform well

Keep Improving

Retrain the engine periodically as more
translation corpus become available
Gather feedback on what needs to be
improved

Statistical PE

• Keep a corpus of raw vs. PE
• Treat them as separate language pairs
• Run them thru DoMY
• Create raw vs. PE engine
• 2 engines: source > target, raw > PE

Questions?
Speak now…
Or reach me at:
www.facebook.com/xlation
www.wordbonds.es
@rubendelafuente
http://www.linkedin.com/in/rubendelafuente

Weitere ähnliche Inhalte

Ähnlich wie Build your own statistical engines

SDL BeGlobal The SDL Platform for Automated TranslationSDL Trados

Tms days 04 2012 manuel herranz pangea mtManuel Herranz

Putting Compilers to WorkSingleStore

New Breakthroughs in Machine Transation Technologykantanmt

Lexcelera MT Breaking CompromisesLoriThicke

TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS - The Language Data Network

System Programing Unit 1Manoj Patil

Evaluation of MT Quality/Productivity at eBay - AMTA 2018Jose Luis Bonilla Sánchez

Alchemy Catalyst AutomationShamusd

5 challenges of scaling l10n workflows KantanMT/bmmt webinarkantanmt

computer architecture and organization.pptmuhammadosama0121

EAMT Workshop 2015 - KantanMTkantanmt

unit1pdf__2021_12_14_12_37_34.pdfDrIsikoIsaac

A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerSafe Software

Computer organization basicsDeepak John

Design Like a Pro: Scripting Best PracticesInductive Automation

Compiler Design Introduction Thapar Institute

TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...TAUS - The Language Data Network

A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerSafe Software

Ähnlich wie Build your own statistical engines (20)

SDL BeGlobal The SDL Platform for Automated Translation

Tms days 04 2012 manuel herranz pangea mt

Putting Compilers to Work

New Breakthroughs in Machine Transation Technology

Lexcelera MT Breaking Compromises

TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...

System Programing Unit 1

Evaluation of MT Quality/Productivity at eBay - AMTA 2018

Alchemy Catalyst Automation

5 challenges of scaling l10n workflows KantanMT/bmmt webinar

computer architecture and organization.ppt

EAMT Workshop 2015 - KantanMT

unit1pdf__2021_12_14_12_37_34.pdf

A Safer's Guide to Best Practices for Optimizing Jobs on FME Server

Computer organization basics

Design Like a Pro: Scripting Best Practices

Compiler Design Introduction

TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...

A Safer's Guide to Best Practices for Optimizing Jobs on FME Server

Mehr von Rubén Rodríguez de la Fuente

¿Me entiende el ordenador cuando hablo?Rubén Rodríguez de la Fuente

Tips and tricks for PERubén Rodríguez de la Fuente

Trados studio 09 gestoresRubén Rodríguez de la Fuente

Trados studio 09 traductoresRubén Rodríguez de la Fuente

Presencia internetRubén Rodríguez de la Fuente

Resources for translatorsRubén Rodríguez de la Fuente

L10 n case studyRubén Rodríguez de la Fuente

Trayectoria rubenRubén Rodríguez de la Fuente

El traductor en plantillaRubén Rodríguez de la Fuente

Presencia internetRubén Rodríguez de la Fuente

Translators on the go Rubén Rodríguez de la Fuente

Taller de traducción automáticaRubén Rodríguez de la Fuente

FOSS4XL8RsRubén Rodríguez de la Fuente

Mehr von Rubén Rodríguez de la Fuente (13)

¿Me entiende el ordenador cuando hablo?

Tips and tricks for PE

Trados studio 09 gestores

Trados studio 09 traductores

Presencia internet

Resources for translators

L10 n case study

Trayectoria ruben

El traductor en plantilla

Presencia internet

Translators on the go

Taller de traducción automática

FOSS4XL8Rs

Kürzlich hochgeladen

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

ICT role in 21st century education and its challengesrafiqahmad00786416

Corporate and higher education May webinar.pptxRustici Software

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

MINDCTI Revenue Release Quarter One 2024MIND CTI

GenAI Risks & Security Meetup 01052024.pdflior mazor

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Architecting Cloud Native ApplicationsWSO2

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Data Cloud, More than a CDP by Matt Robison

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

ICT role in 21st century education and its challenges

Corporate and higher education May webinar.pptx

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

AWS Community Day CPH - Three problems of Terraform

presentation ICT roal in 21st century education

Axa Assurance Maroc - Insurer Innovation Award 2024

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Ransomware_Q4_2023. The report. [EN].pdf

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

MINDCTI Revenue Release Quarter One 2024

GenAI Risks & Security Meetup 01052024.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Architecting Cloud Native Applications

Build your own statistical engines

1. Build Your Own Statistical Machine Translation Engines Ruben de la Fuente

2. About Me • 4-year degree in translation • Worked as translator for 10+ years • Working full time in MT for the past year

3. Agenda • Quick comparison with RbMT • Fundamentals of SMT • Requirements and preparation • Using DoMY

4. Disclaimer • I’m not saying SMT is better • I’m not saying SMT is right for you

5. Statistical Machine Translation Computer learns to translate through statistical analysis of alignment in bilingual corpora

6. Rule-based Machine Translation User Dictionaries + Grammar and translation rules

7. SMT: Pros and Cons Pros Cons Quick to build Unpredictable Cheap Quick Fluent improvements not easy

8. Features of an SMT system • Translation Model: table containing source and target phrases, together with a probability score (accuracy) • Language Model: list of sequences of n-words in target language together with a probability score (fluency)

9. Language and Translation Models • LM (fluency) • TM (accuracy)

10. Tokenization and recasing Breaking up text in Lowercase all words meaningul units (tokens) File > file file? > file ? file. > file . File! > file !

11. Requirements: Computing •4 GB RAM PC needed •Ubuntu 10.04 64-bit OS •Virtual Machine OK

12. Requirements: size MS Translator Hub recommends at least 10k segments I have gotten good results with 100-200k segments Roughly over 1 million words corpus

13. Publicly Available Corpora • Opus (ECB, EMA, OpenOffice) • Acquis Communautaire • Europarl • Hansard • Multilingual websites: Bitextor

14. Bitextor is Cunning www.mywebsite.com/en/overview.html www.mywebsite.com/es/overview.html <title>My source text</title> <title>My target text</title>

15. Requirements: relevance Data needs to be in-domain

16. Requirements: quality Garbage in, garbage out Diagnose your TMs with automated QA checks (e.g. glossary adherence, length)

17. CheckMate: General

18. CheckMate: Length

19. CheckMate: Terminology

20. Remove Repetitions

21. Remove Markup Markup brings noise to the learning process Click <strong>Send</strong> Haga clic en <strong>Enviar</strong>

22. Do-Moses-Yourself (DoMY) Moses: state-of-the-art extensively used open source SMT toolkit DoMY: extension of Moses making installation and configuration easier

23. Online SMT Portals Cons letsmt.eu NDA-compliance smartmate.co Availability Speed

24. DoMY (Basics) Graphs: import-tmx, clean-LM/TM, build LM/TM, train, translate. Ini files: configuration (language pairs, paths for input and output). Folder structure: always include superdomain, domain and subdomain

25. Folder structure corpus graphs

26. Run from terminal Edit ini Command line

27. Running from GUI

28. Graphs Graph Function Input Output Import-tmx Extract data from Raw Corpora/sa tmx files Clean-tm Clean data Corpora/sa Corpora/re ady Build-lm Prepares training Corpora/re builds set for LM ady Build-tm Prepares training Corpora/re builds set for TM ady Train Trains MT engine Builds engines Translate Translates input Translation Translation files and produces s/in s/out tmx output

29. Tips for settings LM: 7-gram TM: 9-gram Aligner: Berkeley for distant languages

30. Troubleshoot Error message in terminal Log file in graph folder DoMT QA

31. Is Your Engine Good? A set is excluded from training to be used for evaluation (598 segments) From 0.5 BLEU points, engine is likely to perform well

32. Keep Improving Retrain the engine periodically as more translation corpus become available Gather feedback on what needs to be improved

33. Statistical PE • Keep a corpus of raw vs. PE • Treat them as separate language pairs • Run them thru DoMY • Create raw vs. PE engine • 2 engines: source > target, raw > PE

34. Questions? Speak now… Or reach me at: www.facebook.com/xlation www.wordbonds.es @rubendelafuente http://www.linkedin.com/in/rubendelafuente

Hinweis der Redaktion

Why? SMT is based in probability, calculated as # of a given token / total amount of tokens. Case and punctuation can disrupt the calculation.
To get good results with SMT, you need around 10.000 segments at least
Using Olifant from Okapi Framework
Clean data: remove too long/short, empty sentences

Build your own statistical engines

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Build your own statistical engines

Ähnlich wie Build your own statistical engines (20)

Mehr von Rubén Rodríguez de la Fuente

Mehr von Rubén Rodríguez de la Fuente (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Build your own statistical engines

Hinweis der Redaktion