SlideShare a Scribd company logo
1 of 28
Download to read offline
How to build own
translator in 15 minutes
Neural Machine Translation in practice
Bartek Rozkrut
2040.io
Why so
important?
40 billion USD /
year industry
Huge barrier for
many people
Provide unlimited
access to
knowledge
Scale NLP
problems
Why own translator?
1.Private / sensitive data
2.Huge amount of data – eg. e-mail translation (cost)
3.Off-line / off-cloud / on-premise
4.Custom domain-specific translation / vocabulary
Neural Machine Translation – example workflow
1. Download Parallel Corpus files
2. Append all corpus files (source + target) in same order
3. Split TRAIN / VAL set
4. Tokenization
5. Preprocess
6. Train
7. Release model (CPU compatible)
8. Translate!
9. REPEAT! 
Parallel Corpus – public data
HTTP://OPUS.LINGFIL.UU.SE
Parallel Corpus (source file – PL, EUROPARL)
1.Tytuł: Admirał NATO potrzebuje przyjaciół.
2.Dziękuję.
3.Naprawdę potrzebuję...
4.Ten program stał się katalizatorem. Następnego dnia setki
osób chciały mnie dodać do znajomych. Indonezyjczycy i
Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan
znajomych, a tak przy okazji, co to jest NATO?"
Parallel Corpus (target file - EN , EUROPARL)
1.The headline was: NATO Admiral Needs Friends.
2.Thank you.
3.Which I do.
4.And the story was a catalyst, and the next morning I had
hundreds of Facebook friend requests from Indonesians and
Finns, mostly saying, "Admiral, we heard you need a friend, and
oh, by the way, what is NATO?"
Vocabulary
1.Word level
2.Sub-word level (eg. Byte Pair Encoding)
3.Character level
BLEU
HTTP://OPENNMT.NET/
OPENNMT – DECEMBER 2016
HTTPS://GOOGLE.GITHUB.IO/SEQ2SEQ/
GOOGLE’S SEQ2SEQ – MARCH 2017
Our experience from PL=>EN training
1.100k vocabulary (word-level)
2.Bidirectional LSTM, 2 layers, RNN size 500
3.5M sentences from public data sources
4.~ 20 BLEU
OpenNMT – run Docker container
Run CPU-based interactive session with command:
sudo docker run -it 2040/opennmt bash
Run GPU-based interactive session with command:
sudo nvidia-docker run -it 2040/opennmt bash
OpenNMT – split paralell corpus
split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt
mv xaa train-src.txt
mv xab val-src.txt
split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt
mv xaa train-tgt.txt
mv xab val-tgt.txt
OpenNMT – preprocess paralell corpus
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt >
train-src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt >
train-tgt.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val-
src.txt.tok
th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val-
tgt.txt.tok
th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok -
valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
OpenNMT – train && release && translate
th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model
model -gpuid 1
th tools/release_model.lua -model model.t7 -gpuid 1
th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid
1
Best hyperparams from 250k GPU hours (thx Google)
HTTPS://ARXIV.ORG/ABS/1703.03906
Other applications
1.Image 2 Text
2.OCR (eg. Tesseract OCR v4.0 – LSTM)
3.Lip reading
4.Simple Q&A
5.Chatbots
HTTP://WEB.STANFORD.EDU/CLASS/CS224N/
SLIDES USED WITH PERMISSION FROM RICHARD SOCHER
Thanks!
Bartek Rozkrut
bartek@2040.io

More Related Content

What's hot

TLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and FifosTLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and Fifos
Shu-Yu Fu
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
hugo lu
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
Angelo Failla
 

What's hot (20)

Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017Compiling P4 to XDP, IOVISOR Summit 2017
Compiling P4 to XDP, IOVISOR Summit 2017
 
Experimental dtrace
Experimental dtraceExperimental dtrace
Experimental dtrace
 
tokyotalk
tokyotalktokyotalk
tokyotalk
 
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
2014.10 - Towards Description Set Profiles for RDF Using SPARQL as Intermedia...
 
Playing Nice with Others
Playing Nice with OthersPlaying Nice with Others
Playing Nice with Others
 
introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack introduction to linux kernel tcp/ip ptocotol stack
introduction to linux kernel tcp/ip ptocotol stack
 
Text tagging with finite state transducers
Text tagging with finite state transducersText tagging with finite state transducers
Text tagging with finite state transducers
 
Memory Barriers in the Linux Kernel
Memory Barriers in the Linux KernelMemory Barriers in the Linux Kernel
Memory Barriers in the Linux Kernel
 
Ns2pre
Ns2preNs2pre
Ns2pre
 
Automata Invasion
Automata InvasionAutomata Invasion
Automata Invasion
 
Learning RSocket Using RSC
Learning RSocket Using RSCLearning RSocket Using RSC
Learning RSocket Using RSC
 
Linux50commands
Linux50commandsLinux50commands
Linux50commands
 
TLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and FifosTLPI - Chapter 44 Pipe and Fifos
TLPI - Chapter 44 Pipe and Fifos
 
Versioned Triple Pattern Fragments
Versioned Triple Pattern FragmentsVersioned Triple Pattern Fragments
Versioned Triple Pattern Fragments
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
The linux networking architecture
The linux networking architectureThe linux networking architecture
The linux networking architecture
 
Serialization in Go
Serialization in GoSerialization in Go
Serialization in Go
 
Golang concurrency design
Golang concurrency designGolang concurrency design
Golang concurrency design
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
 
FBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp serversFBTFTP: an opensource framework to build dynamic tftp servers
FBTFTP: an opensource framework to build dynamic tftp servers
 

Similar to AIMeetup #4: Neural-machine-translation

Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Timothy Spann
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
Timothy Spann
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf
cifoxo
 

Similar to AIMeetup #4: Neural-machine-translation (20)

MOSP Walkthrough 2009
MOSP Walkthrough 2009MOSP Walkthrough 2009
MOSP Walkthrough 2009
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Scaleable PHP Applications in Kubernetes
Scaleable PHP Applications in KubernetesScaleable PHP Applications in Kubernetes
Scaleable PHP Applications in Kubernetes
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector Builder
 
REAL TIME OPERATING SYSTEM PART 2
REAL TIME OPERATING SYSTEM PART 2REAL TIME OPERATING SYSTEM PART 2
REAL TIME OPERATING SYSTEM PART 2
 
Basic Linux Internals
Basic Linux InternalsBasic Linux Internals
Basic Linux Internals
 
Cytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis ToolsCytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis Tools
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Will iPython replace Bash?
Will iPython replace Bash?Will iPython replace Bash?
Will iPython replace Bash?
 
Will iPython replace bash?
Will iPython replace bash?Will iPython replace bash?
Will iPython replace bash?
 
Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会
Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会
Autoware vs. Computer Performance @ ROS Japan UG #43 組み込み勉強会
 
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
Using FLiP with InfluxDB for EdgeAI IoT at Scale 2022
 
Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022Using FLiP with influxdb for edgeai iot at scale 2022
Using FLiP with influxdb for edgeai iot at scale 2022
 
Persistent Memory Development Kit (PMDK): State of the Project
Persistent Memory Development Kit (PMDK): State of the ProjectPersistent Memory Development Kit (PMDK): State of the Project
Persistent Memory Development Kit (PMDK): State of the Project
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
 
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache PulsarApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
ApacheCon2022_Deep Dive into Building Streaming Applications with Apache Pulsar
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf2023-02-22_Tiberti_CyberX.pdf
2023-02-22_Tiberti_CyberX.pdf
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 

More from 2040.io

More from 2040.io (16)

Jak budujemy inteligentnego asystenta biznesowego
Jak budujemy inteligentnego asystenta biznesowegoJak budujemy inteligentnego asystenta biznesowego
Jak budujemy inteligentnego asystenta biznesowego
 
Obsługa klienta z wykorzystaniem sztucznej inteligencji
Obsługa klienta z wykorzystaniem sztucznej inteligencjiObsługa klienta z wykorzystaniem sztucznej inteligencji
Obsługa klienta z wykorzystaniem sztucznej inteligencji
 
Jak AI pozwala nam usłyszeć głos klienta
Jak AI pozwala nam usłyszeć głos klientaJak AI pozwala nam usłyszeć głos klienta
Jak AI pozwala nam usłyszeć głos klienta
 
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstuWyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
Wyzwania związane z modelowaniem mobilnych systemów świadomych kontekstu
 
Rozpoznawanie mowy: problem rozwiązany?
Rozpoznawanie mowy: problem rozwiązany?Rozpoznawanie mowy: problem rozwiązany?
Rozpoznawanie mowy: problem rozwiązany?
 
Czy Deep Learning działa?
Czy Deep Learning działa?Czy Deep Learning działa?
Czy Deep Learning działa?
 
Analiza semantyczna zasosowana w środowisku Menerva
Analiza semantyczna zasosowana w środowisku MenervaAnaliza semantyczna zasosowana w środowisku Menerva
Analiza semantyczna zasosowana w środowisku Menerva
 
Time-series prediction with neural networks
Time-series prediction with neural networksTime-series prediction with neural networks
Time-series prediction with neural networks
 
AIMeetup #4: Artificial intelligence and economics
AIMeetup #4: Artificial intelligence and economicsAIMeetup #4: Artificial intelligence and economics
AIMeetup #4: Artificial intelligence and economics
 
AIMeetup #4: Let’s compete with machine! edrone crm
AIMeetup #4: Let’s compete with machine! edrone crmAIMeetup #4: Let’s compete with machine! edrone crm
AIMeetup #4: Let’s compete with machine! edrone crm
 
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
AIMeetup #3: Uczenie maszynowe - rocket science czy chleb powszedni?
 
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje daneAIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
 
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
AIMeetup #2: A.I. - podstawowe pojęcia techniczneAIMeetup #2: A.I. - podstawowe pojęcia techniczne
AIMeetup #2: A.I. - podstawowe pojęcia techniczne
 
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
AIMeetup #2: Jak dzięki Data Mining księgujemy automatycznie koszty w Infakt.pl?
 
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
AIMeetup #2: Jak wykorzystaliśmy technologię rozpoznawania mowy i mówcy do au...
 
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję? AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
AIMeetup #2: Gdzie można nakarmić sztuczną inteligencję?
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

AIMeetup #4: Neural-machine-translation

  • 1. How to build own translator in 15 minutes Neural Machine Translation in practice Bartek Rozkrut 2040.io
  • 2. Why so important? 40 billion USD / year industry Huge barrier for many people Provide unlimited access to knowledge Scale NLP problems
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11. Why own translator? 1.Private / sensitive data 2.Huge amount of data – eg. e-mail translation (cost) 3.Off-line / off-cloud / on-premise 4.Custom domain-specific translation / vocabulary
  • 12. Neural Machine Translation – example workflow 1. Download Parallel Corpus files 2. Append all corpus files (source + target) in same order 3. Split TRAIN / VAL set 4. Tokenization 5. Preprocess 6. Train 7. Release model (CPU compatible) 8. Translate! 9. REPEAT! 
  • 13. Parallel Corpus – public data HTTP://OPUS.LINGFIL.UU.SE
  • 14. Parallel Corpus (source file – PL, EUROPARL) 1.Tytuł: Admirał NATO potrzebuje przyjaciół. 2.Dziękuję. 3.Naprawdę potrzebuję... 4.Ten program stał się katalizatorem. Następnego dnia setki osób chciały mnie dodać do znajomych. Indonezyjczycy i Finowie Pisali: "Admirale, słyszeliśmy, że potrzebuje pan znajomych, a tak przy okazji, co to jest NATO?"
  • 15. Parallel Corpus (target file - EN , EUROPARL) 1.The headline was: NATO Admiral Needs Friends. 2.Thank you. 3.Which I do. 4.And the story was a catalyst, and the next morning I had hundreds of Facebook friend requests from Indonesians and Finns, mostly saying, "Admiral, we heard you need a friend, and oh, by the way, what is NATO?"
  • 16. Vocabulary 1.Word level 2.Sub-word level (eg. Byte Pair Encoding) 3.Character level
  • 17. BLEU
  • 20. Our experience from PL=>EN training 1.100k vocabulary (word-level) 2.Bidirectional LSTM, 2 layers, RNN size 500 3.5M sentences from public data sources 4.~ 20 BLEU
  • 21. OpenNMT – run Docker container Run CPU-based interactive session with command: sudo docker run -it 2040/opennmt bash Run GPU-based interactive session with command: sudo nvidia-docker run -it 2040/opennmt bash
  • 22. OpenNMT – split paralell corpus split -l $[ $(wc -l src.txt|cut -d" " -f1) * 9/10 ] src.txt mv xaa train-src.txt mv xab val-src.txt split -l $[ $(wc -l tgt.txt|cut -d" " -f1) * 9/10 ] tgt.txt mv xaa train-tgt.txt mv xab val-tgt.txt
  • 23. OpenNMT – preprocess paralell corpus th tools/tokenize.lua -joiner_annotate -mode aggressive < train-src.txt > train-src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < train-tgt.txt > train-tgt.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-src.txt > val- src.txt.tok th tools/tokenize.lua -joiner_annotate -mode aggressive < val-tgt.txt > val- tgt.txt.tok th preprocess.lua -train_src train-src.txt.tok -train_tgt train-tgt.txt.tok - valid_src val-src.txt.tok -valid_tgt val-tgt.txt.tok -save_data _data
  • 24. OpenNMT – train && release && translate th train.lua -data _data-train.t7 -layers 2 -rnn_size 500 -brnn -save_model model -gpuid 1 th tools/release_model.lua -model model.t7 -gpuid 1 th translate.lua -model model.t7 -src src-val.txt -output file-tgt.tok -gpuid 1
  • 25. Best hyperparams from 250k GPU hours (thx Google) HTTPS://ARXIV.ORG/ABS/1703.03906
  • 26. Other applications 1.Image 2 Text 2.OCR (eg. Tesseract OCR v4.0 – LSTM) 3.Lip reading 4.Simple Q&A 5.Chatbots