SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Build Your Own
 Statistical Machine
Translation Engines
            Ruben de la Fuente
About Me

• 4-year degree in translation
• Worked as translator for 10+ years
• Working full time in MT for the past
  year
Agenda


•   Quick comparison with RbMT
•   Fundamentals of SMT
•   Requirements and preparation
•   Using DoMY
Disclaimer



• I’m not saying SMT is better
• I’m not saying SMT is right for you
Statistical Machine Translation

Computer learns to translate through
statistical analysis of alignment in
bilingual corpora
Rule-based Machine Translation

User Dictionaries + Grammar and
translation rules
SMT: Pros and Cons
Pros              Cons

Quick to build    Unpredictable
Cheap             Quick
Fluent            improvements not
                  easy
Features of an SMT system

• Translation Model: table containing
  source and target phrases, together
  with a probability score (accuracy)
• Language Model: list of sequences of
  n-words in target language together
  with a probability score (fluency)
Language and Translation Models
• LM (fluency)     • TM (accuracy)
Tokenization and recasing
Breaking up text in        Lowercase all words
meaningul units (tokens)
                           File > file
                           file? > file ?
                           file. > file .
                           File! > file !
Requirements: Computing


•4 GB RAM PC needed
•Ubuntu 10.04 64-bit OS
•Virtual Machine OK
Requirements: size

MS Translator Hub recommends at least
10k segments
I have gotten good results with 100-200k
segments
Roughly over 1 million words corpus
Publicly Available Corpora

• Opus (ECB, EMA, OpenOffice)
• Acquis Communautaire
• Europarl
• Hansard
• Multilingual websites: Bitextor
Bitextor is Cunning

www.mywebsite.com/en/overview.html
www.mywebsite.com/es/overview.html
<title>My source text</title>
<title>My target text</title>
Requirements: relevance


Data needs to be in-domain
Requirements: quality

Garbage in, garbage out
Diagnose your TMs with automated QA
checks (e.g. glossary adherence, length)
CheckMate: General
CheckMate: Length
CheckMate: Terminology
Remove Repetitions
Remove Markup

Markup brings noise to the learning
process
Click <strong>Send</strong>
Haga clic en <strong>Enviar</strong>
Do-Moses-Yourself (DoMY)

Moses: state-of-the-art extensively used
open source SMT toolkit
DoMY: extension of Moses making
installation and configuration easier
Online SMT Portals
                  Cons
letsmt.eu
                  NDA-compliance
smartmate.co      Availability
                  Speed
DoMY (Basics)

Graphs: import-tmx, clean-LM/TM, build
LM/TM, train, translate.
Ini files: configuration (language pairs,
paths for input and output).
Folder structure: always include
superdomain, domain and subdomain
Folder structure
corpus           graphs
Run from terminal
Edit ini            Command line
Running from GUI
Graphs
Graph        Function             Input       Output
Import-tmx   Extract data from    Raw         Corpora/sa
             tmx files
Clean-tm     Clean data           Corpora/sa Corpora/re
                                             ady
Build-lm     Prepares training    Corpora/re builds
             set for LM           ady
Build-tm     Prepares training    Corpora/re builds
             set for TM           ady

Train        Trains MT engine     Builds      engines
Translate    Translates input     Translation Translation
             files and produces   s/in        s/out
             tmx output
Tips for settings

LM: 7-gram
TM: 9-gram
Aligner: Berkeley for distant languages
Troubleshoot

Error message in terminal
Log file in graph folder
DoMT QA
Is Your Engine Good?

A set is excluded from training to be used
for evaluation (598 segments)
From 0.5 BLEU points, engine is likely to
perform well
Keep Improving

Retrain the engine periodically as more
translation corpus become available
Gather feedback on what needs to be
improved
Statistical PE

• Keep a corpus of raw vs. PE
• Treat them as separate language pairs
• Run them thru DoMY
• Create raw vs. PE engine
• 2 engines: source > target, raw > PE
Questions?
Speak now…
Or reach me at:
www.facebook.com/xlation
www.wordbonds.es
@rubendelafuente
http://www.linkedin.com/in/rubendelafuente

Weitere ähnliche Inhalte

Ähnlich wie Build your own statistical engines

SDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated TranslationSDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated TranslationSDL Trados
 
Tms days 04 2012 manuel herranz pangea mt
Tms days 04 2012 manuel herranz pangea mtTms days 04 2012 manuel herranz pangea mt
Tms days 04 2012 manuel herranz pangea mtManuel Herranz
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to WorkSingleStore
 
New Breakthroughs in Machine Transation Technology
New Breakthroughs in Machine Transation TechnologyNew Breakthroughs in Machine Transation Technology
New Breakthroughs in Machine Transation Technologykantanmt
 
Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLoriThicke
 
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS - The Language Data Network
 
System Programing Unit 1
System Programing Unit 1System Programing Unit 1
System Programing Unit 1Manoj Patil
 
Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018Jose Luis Bonilla Sánchez
 
Alchemy Catalyst Automation
Alchemy Catalyst AutomationAlchemy Catalyst Automation
Alchemy Catalyst AutomationShamusd
 
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinarkantanmt
 
computer architecture and organization.ppt
computer architecture and organization.pptcomputer architecture and organization.ppt
computer architecture and organization.pptmuhammadosama0121
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTkantanmt
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfDrIsikoIsaac
 
A Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerA Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerSafe Software
 
Computer organization basics
Computer organization  basicsComputer organization  basics
Computer organization basicsDeepak John
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesInductive Automation
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesInductive Automation
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction Thapar Institute
 
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...TAUS - The Language Data Network
 
A Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerA Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerSafe Software
 

Ähnlich wie Build your own statistical engines (20)

SDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated TranslationSDL BeGlobal The SDL Platform for Automated Translation
SDL BeGlobal The SDL Platform for Automated Translation
 
Tms days 04 2012 manuel herranz pangea mt
Tms days 04 2012 manuel herranz pangea mtTms days 04 2012 manuel herranz pangea mt
Tms days 04 2012 manuel herranz pangea mt
 
Putting Compilers to Work
Putting Compilers to WorkPutting Compilers to Work
Putting Compilers to Work
 
New Breakthroughs in Machine Transation Technology
New Breakthroughs in Machine Transation TechnologyNew Breakthroughs in Machine Transation Technology
New Breakthroughs in Machine Transation Technology
 
Lexcelera MT Breaking Compromises
Lexcelera MT Breaking CompromisesLexcelera MT Breaking Compromises
Lexcelera MT Breaking Compromises
 
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
TAUS MT SHOWCASE, The WeMT Program, Olga Beregovaya, Welocalize, 10 October 2...
 
System Programing Unit 1
System Programing Unit 1System Programing Unit 1
System Programing Unit 1
 
Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018Evaluation of MT Quality/Productivity at eBay - AMTA 2018
Evaluation of MT Quality/Productivity at eBay - AMTA 2018
 
Alchemy Catalyst Automation
Alchemy Catalyst AutomationAlchemy Catalyst Automation
Alchemy Catalyst Automation
 
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar5 challenges of scaling l10n workflows KantanMT/bmmt webinar
5 challenges of scaling l10n workflows KantanMT/bmmt webinar
 
computer architecture and organization.ppt
computer architecture and organization.pptcomputer architecture and organization.ppt
computer architecture and organization.ppt
 
EAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMTEAMT Workshop 2015 - KantanMT
EAMT Workshop 2015 - KantanMT
 
unit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdfunit1pdf__2021_12_14_12_37_34.pdf
unit1pdf__2021_12_14_12_37_34.pdf
 
A Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerA Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME Server
 
Computer organization basics
Computer organization  basicsComputer organization  basics
Computer organization basics
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
Design Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best PracticesDesign Like a Pro: Scripting Best Practices
Design Like a Pro: Scripting Best Practices
 
Compiler Design Introduction
Compiler Design Introduction Compiler Design Introduction
Compiler Design Introduction
 
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
TAUS MT SHOWCASE, Creating Competitive Advantage with Rapid Customization & D...
 
A Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME ServerA Safer's Guide to Best Practices for Optimizing Jobs on FME Server
A Safer's Guide to Best Practices for Optimizing Jobs on FME Server
 

Mehr von Rubén Rodríguez de la Fuente (13)

¿Me entiende el ordenador cuando hablo?
¿Me entiende el ordenador cuando hablo?¿Me entiende el ordenador cuando hablo?
¿Me entiende el ordenador cuando hablo?
 
Tips and tricks for PE
Tips and tricks for PETips and tricks for PE
Tips and tricks for PE
 
Trados studio 09 gestores
Trados studio 09 gestoresTrados studio 09 gestores
Trados studio 09 gestores
 
Trados studio 09 traductores
Trados studio 09 traductoresTrados studio 09 traductores
Trados studio 09 traductores
 
Presencia internet
Presencia internetPresencia internet
Presencia internet
 
Resources for translators
Resources for translatorsResources for translators
Resources for translators
 
L10 n case study
L10 n case studyL10 n case study
L10 n case study
 
Trayectoria ruben
Trayectoria rubenTrayectoria ruben
Trayectoria ruben
 
El traductor en plantilla
El traductor en plantillaEl traductor en plantilla
El traductor en plantilla
 
Presencia internet
Presencia internetPresencia internet
Presencia internet
 
Translators on the go
Translators on the go Translators on the go
Translators on the go
 
Taller de traducción automática
Taller de traducción automáticaTaller de traducción automática
Taller de traducción automática
 
FOSS4XL8Rs
FOSS4XL8RsFOSS4XL8Rs
FOSS4XL8Rs
 

Kürzlich hochgeladen

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Build your own statistical engines

Hinweis der Redaktion

  1. Why? SMT is based in probability, calculated as # of a given token / total amount of tokens. Case and punctuation can disrupt the calculation.
  2. To get good results with SMT, you need around 10.000 segments at least
  3. Using Olifant from Okapi Framework
  4. Clean data: remove too long/short, empty sentences