SlideShare ist ein Scribd-Unternehmen logo
1 von 53
What makes learning deep?
Schedule
• Kevin Duh - Fundamentals for DL4MT I & II
• Lab 1 - Prep and setup. Compare logistic regression, MLP,
and stacked auto-encoders on data
• Lab 2 - word embedding (SGNS), visualization.
• Hermann Ney - Neural LMs and TMs for SMT I & II
• Lab 1 - Rescore n-best lists using RNN LM
• Lab 2 - n-best rescoring using uni/bidirectional translation and joint models
• Kyunghyun Cho - Neural MT I & II
• Lab 1 - Data Preparation: Basic preprocessing; Encoder-Decoder with Theano
(without attention)
• Lab 2 - Attention-based Encoder-Decoder with Theano
Neural Networks (NN) – brief introduction
• An NN consists of:
• Multiple layers (an input layer, zero
or more hidden layers, and an
output layer) that consist of a set of
neurones (xi, hj, and y)
• Interconnections between nodes of
different layers that have weights
assigned (wij and wj)
• Activation functions for each
neurone that convert weighted
input of neurons into output values
• Deep neural networks –NNs with
one or more hidden layers
Activation
functions
• Many different
types
• Most common
functions in NLP –
logistic sigmoid
and hyperbolic
tangent (because
of their non-
linearity)
http://blog.sciencenet.cn/blog-457187-878461.html
Training
• Neural networks are trained
with the backpropagation
algorithm
• Aim – reduce cost/error of the
model by iteratively performing a
forward pass, calculating the
error, and adjusting weights
based on the “direction” (read -
derivative) of the error. The error
is “backpropagated” from the
output layer back to the input
layer
• A long description with a lot of
theory:
http://neuralnetworksanddeep
learning.com/chap2.html
Word Embeddings
• A continuous space representation of words
• Multi-dimensional vectors
• Decimal values
• By-product of neural networks where words are vectorised
• Skip-gram models
• neural network language models
• neural machine translation models
• etc.
Word Embeddings –
Skip-Gram model
• Input – word (one of N vector)
• Output – context of the word
• Embeddings – the trained weight
matrix W between the input layer
and the hidden layer
• Each row represents the embedding
of a single word
• Implementation: word2vec
http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
Word embeddings are different
• The distributional representation of words has to be trained for specific tasks
(e.g., Skip-gram word embeddings are not good for translation)
• Similar words (according to cosine similarity) of the given words using different
word embedding models
http://arxiv.org/pdf/1412.6448.pdf
Feedforward Neural Net Language Model (NNLM)
• Y. Bengio, R.
Ducharme, P. Vincent.
A neural probabilistic
language model.
Journal of Machine
Learning Research,
3:1137-1155, 2003.
Feedforward Neural Net Language Model (NNLM)
• The same,
but with a
simpler
figure
/ http://www.cs.cmu.edu/~mfaruqui/talks/nn-clab.pdf /
What NNLMs are (supposedly) good at
(… what n-gram models never will)?
Language Modelling and Machine
Translation using Neural Networks
Hermann Ney
http://ej.uz/NNLM
Language Modelling
• Conventional Language Modelling
• Measure the quality of an LM with perplexity
• Problem: most of the events are never seen in training data
Language Modelling Using Neural Networks
• non ANN: count models (Markov chain):
• limited history of predecessor words
• smooth relative frequencies
• feedforward multi-layer perceptron (FF MLP):
• limited history too
• use predecessor words as input to MLP
• recurrent neural networks (RNN):
• advantage: unlimited history
Recurrent Neural Net
Language Model (RNNLM)
Experiments Results
with JRC-ACQUIS EN-LV
Approach Perplexity CPU time Size
5-gram count language model 48.0376 5 minutes + 2 minutes (binarize) 1118MB
4-gram Feedforward Neural Net
Language Model with 2 layers
• 1000 wrord classes;
• batch-size 64;
• learning-rate 5e-3;
• 200 nodes per input word, and a
subsequent layer of 200 nodes
with a sigmoid activation
function.
126.9841 1 week 43MB
Practicalities of ANN LM Training
(Implementation and Software)
• no regularization, no momentum term, no drop-out (so far!)
• no pre-training (so far!)
• vocabulary reduction: remove singletons, or keep most frequent words
• random initialization of weights: Gaussian of mean 0, variance 0.01
• training criterion: cross-entropy (perplexity)
• stopping: cross-validation, perplexity on a development text
• initial learning rate: typically between 1 · e−3 and 1 · e−2
• learning rate: halved when the dev perplexity is worse than the best of
previous epochs
• use of mini-batches: 4 to 64
• low level implementation in C++
• GPUs (typically) not used for the results presented in this talk
Language Modelling Using Neural Networks lab
Exercice Sheet: http://ej.uz/NNLMlab
https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php
• Data preparation
• Training on small data
• Tuning of hyper-parameters
• Modifying the network architecture
Conventional SMT
Translation Model based on FF MLP
Joint Language and Translation Model
based on Feedforward MLP
Experiments Results
Experiments Results
Experiments Results
Syntax-based Multi-System Hybrid Translator + NNLM on JRC-ACQUIS EN-LV
Approach BLEU
MHyT with 5-gram count language model 22.69
SyMHyT with 5-gram count language model 24.72
SyMHyT with 4-gram Feedforward Neural Net Language Model with 2 layers 23.71
Machine Translation using Neural Networks lab
• Exercice Sheet: http://ej.uz/NNMTlab
• Part 1: N-best Reranking using Neural Network Language Models
• Obtain the new 1-best hypotheses
• Measure the translation quality
• Do reranking and compare to the results obtained before reranking.
• Part 2: Neural Network Translation Models
• Train a unidirectional translation model
• Train a unidirectional joint model
• Train a bidirectional translation model
• Train a bidirectional joint model
• Try to obtain better perplexity values by changing the batch size and learning rate
• Part 3: N-best Reranking using Neural Network Translation Models
• Apply rescoring using each of the unidirectional and bidirectional translation and joint models
• Optimize the model weights with MERT to achieve the best BLEU score on the dev dataset
• Evaluate the translation hypotheses for each of the rescoring experiments
Neural Machine Translation
• Encoder-decoder model
https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
Neural Machine Translation
Neural Machine Translation
• Encoder-decoder model
• (a) – encoder
• (b) - decoder
https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
http://devblogs.nvidia.com/parallelforall/introduction-
neural-machine-translation-gpus-part-2/
Neural Machine Translation
• Bi-directional recurrent neural networks for attention-based models
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
Neural Machine Translation
• Attention-based models
• Due to its sequential nature, a
recurrent neural network
tends to remember recent
symbols better
• The attention mechanism
allows to focus at each time
step at the relevant symbols by
selecting the appropriate
vector that summarises the
input sentence
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
Attention-based encoder-decoder NMT
• English-Latvian
• Vocabulary – 30K
• Embedding dimensions – 400
• Hidden layer dimensions – 1,024
• Batch size - 14
• Training corpus – DGT-TM (2,401,815 unique sentences)
• Trainined on an NVIDIA GeForce GTX 960 (2GB) GPU
• Training time - ~4 days and 2 hours (crashed due to out of memory exception)
• But ... Luckily it saves models iteratively
• During training uses ~40GB of virtual memory
• Translation time of 512 sentence test set
(however, it includes also the model loading time)
• NMT: 19 minutes and 3 seconds (translation with CPU on 6 cores)
• LetsMT: 1 minute and 39 seconds (translation with CPU on 1 core)
https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2
Attention-based encoder-decoder NMT
• Comparison with LetsMT (LetsMT – 13.93 BLEU, NMT – 12.42 BLEU)
• Translations more fluent (even if not always correct «according to reference»)
Attention-based encoder-decoder NMT
• Unknown words are a problem
Attention-based encoder-decoder NMT
• Sometimes the context around unknown words is surprisingly good
Attention-based encoder-decoder NMT
• Sometimes the NMT creates a translation that is (probably) equally as good as
the reference
Attention-based encoder-decoder NMT
• However, sometimes the translation is also bad (total nonsense)
Attention-based encoder-decoder NMT
• Named entities that are listed with commas can cause issues
Attention-
based
encoder-
decoder
NMT
• Word alignments
• What is different
from Giza++?
• LetsMT translation:
šīs paradigma
pamatelements
ir jaunā
struktūra
informācijas
apstrādes
sistēmu .
Attention-based encoder-decoder NMT
• Model trained on 50K vocabulary and a bath size of 12
• After 300,000 updates (or 3,600,000 observed sentences)
• I.e., model is not yet fully trained
• Results:
• LetsMT – 13.93 BLEU, NMT – 12.48 BLEU(+0.06)
• Not good, but it may improve since the model has not finished training…
• After 520,000 updates (6,240,000 observed sentences)
• LetsMT – 13.93 BLEU, NMT – 11.88 BLEU(-0.54)
Attention-based encoder-decoder NMT
• Lessons learned (from the tiny but long experiments):
• You need to have a «good» GPU (6GB GDDR may not be enough for systems of a decent
vocabulary size)
• A 2GB card will not allow building models with a vocabulary that is larger than 30-50K
• The «good» GPUs are expensive (>1K€)
• Only Nvidia GPUs are currently usable (the existing libraries are built/tuned for CUDA), OpenCL is an
alternative that is under-supported
• If training with a GPU takes up to a week, training with a CPU is a no go
• 30K is miles away from a decent vocabulary
• You need to have means to handle unknown words
• From the translation quality positive tendencies are evident, but an experiment with a
more decent data set and a larger vocabulary is necessary to make more justified
judgements
TensorFlow
https://www.tensorflow.org/
• Deep Flexibility
• True Portability
• Connect Research and Production
• Auto-Differentiation
• Language Options
• Maximize Performance
Theano
http://deeplearning.net/software/theano/
• Built for python
• Tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
• Transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
• Efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
• Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
• Dynamic C code generation – Evaluate expressions faster.
• Extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
• The EN-LV NMT model was trained using Theano
• Speed comparison of different NN libraries: https://github.com/soumith/convnet-benchmarks
Deep Learning for Machine Translation
Deep Learning for Machine Translation
Deep Learning for Machine Translation
Deep Learning for Machine Translation
Deep Learning for Machine Translation
Deep Learning for Machine Translation

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
Rnn and lstm
Rnn and lstmRnn and lstm
Rnn and lstm
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Abstractive Text Summarization
Abstractive Text SummarizationAbstractive Text Summarization
Abstractive Text Summarization
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Attention scores and mechanisms
Attention scores and mechanismsAttention scores and mechanisms
Attention scores and mechanisms
 
Real life application of machine learning
Real life application of machine learningReal life application of machine learning
Real life application of machine learning
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
rnn BASICS
rnn BASICSrnn BASICS
rnn BASICS
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Text prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language ModelText prediction based on Recurrent Neural Network Language Model
Text prediction based on Recurrent Neural Network Language Model
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 

Andere mochten auch

1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...
1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...
1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...
Muhamad Hesham
 
投影片範本 Papago!mobile ppt
投影片範本 Papago!mobile ppt投影片範本 Papago!mobile ppt
投影片範本 Papago!mobile ppt
XuanJun Lin
 

Andere mochten auch (20)

On using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translationOn using monolingual corpora in neural machine translation
On using monolingual corpora in neural machine translation
 
Deep Learning for Machine Translation, by Jean Senellart, SYSTRAN
Deep Learning for Machine Translation, by Jean Senellart, SYSTRANDeep Learning for Machine Translation, by Jean Senellart, SYSTRAN
Deep Learning for Machine Translation, by Jean Senellart, SYSTRAN
 
Machine Translation: The Neural Frontier
Machine Translation: The Neural FrontierMachine Translation: The Neural Frontier
Machine Translation: The Neural Frontier
 
Google's Multilingual Neural Machine Translation System
Google's Multilingual Neural Machine Translation SystemGoogle's Multilingual Neural Machine Translation System
Google's Multilingual Neural Machine Translation System
 
significance of theory for nursing as a profession
significance of theory for nursing as a professionsignificance of theory for nursing as a profession
significance of theory for nursing as a profession
 
1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...
1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...
1st Seminar- Intelligent Agent for Medium-Level Artificial Intelligence in Re...
 
Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken...
Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken...Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken...
Leveraging Behavioral Patterns of Mobile Applications for Personalized Spoken...
 
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course LecturesAutomatic Key Term Extraction and Summarization from Spoken Course Lectures
Automatic Key Term Extraction and Summarization from Spoken Course Lectures
 
An Intelligent Assistant for High-Level Task Understanding
An Intelligent Assistant for High-Level Task UnderstandingAn Intelligent Assistant for High-Level Task Understanding
An Intelligent Assistant for High-Level Task Understanding
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
Deep Learning for Machine Translation, by Satoshi Enoue, SYSTRAN
Deep Learning for Machine Translation, by Satoshi Enoue, SYSTRANDeep Learning for Machine Translation, by Satoshi Enoue, SYSTRAN
Deep Learning for Machine Translation, by Satoshi Enoue, SYSTRAN
 
投影片範本 Papago!mobile ppt
投影片範本 Papago!mobile ppt投影片範本 Papago!mobile ppt
投影片範本 Papago!mobile ppt
 
From data to numbers to knowledge: semantic embeddings By Alvaro Barbero
From data to numbers to knowledge: semantic embeddings By Alvaro BarberoFrom data to numbers to knowledge: semantic embeddings By Alvaro Barbero
From data to numbers to knowledge: semantic embeddings By Alvaro Barbero
 
Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)Deep Learning for Computer Vision: Attention Models (UPC 2016)
Deep Learning for Computer Vision: Attention Models (UPC 2016)
 
Facebookの人工知能アルゴリズム「memory networks」について調べてみた
Facebookの人工知能アルゴリズム「memory networks」について調べてみたFacebookの人工知能アルゴリズム「memory networks」について調べてみた
Facebookの人工知能アルゴリズム「memory networks」について調べてみた
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Recurrent neural networks
Recurrent neural networksRecurrent neural networks
Recurrent neural networks
 
Statistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent AssistantsStatistical Learning from Dialogues for Intelligent Assistants
Statistical Learning from Dialogues for Intelligent Assistants
 
Intelligent Agent Perception
Intelligent Agent PerceptionIntelligent Agent Perception
Intelligent Agent Perception
 
Los MOOC ventajas y desventajas
Los MOOC ventajas y desventajasLos MOOC ventajas y desventajas
Los MOOC ventajas y desventajas
 

Ähnlich wie Deep Learning for Machine Translation

Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
MLconf
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 

Ähnlich wie Deep Learning for Machine Translation (20)

Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Story story ppt
Story story pptStory story ppt
Story story ppt
 
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 ReviewNatural Language Generation / Stanford cs224n 2019w lecture 15 Review
Natural Language Generation / Stanford cs224n 2019w lecture 15 Review
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
Using MXNet to Train and Deploy your Deep Learning Model
Using MXNet to Train and Deploy your Deep Learning ModelUsing MXNet to Train and Deploy your Deep Learning Model
Using MXNet to Train and Deploy your Deep Learning Model
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 

Mehr von Matīss ‎‎‎‎‎‎‎  

Mehr von Matīss ‎‎‎‎‎‎‎   (20)

日本のお風呂
日本のお風呂日本のお風呂
日本のお風呂
 
Thrifty Food Tweets on a Rainy Day
Thrifty Food Tweets on a Rainy DayThrifty Food Tweets on a Rainy Day
Thrifty Food Tweets on a Rainy Day
 
私の趣味
私の趣味私の趣味
私の趣味
 
How Masterly Are People at Playing with Their Vocabulary?
How Masterly Are People at Playing with Their Vocabulary?How Masterly Are People at Playing with Their Vocabulary?
How Masterly Are People at Playing with Their Vocabulary?
 
私の町リガ
私の町リガ私の町リガ
私の町リガ
 
大学への交通手段
大学への交通手段大学への交通手段
大学への交通手段
 
小学生に 携帯電話
小学生に 携帯電話小学生に 携帯電話
小学生に 携帯電話
 
Tracing multisensory food experience on twitter
Tracing multisensory food experience on twitterTracing multisensory food experience on twitter
Tracing multisensory food experience on twitter
 
ラトビア大学
ラトビア大学ラトビア大学
ラトビア大学
 
私の趣味
私の趣味私の趣味
私の趣味
 
富士山りょこう
富士山りょこう富士山りょこう
富士山りょこう
 
Tips and Tools for NMT
Tips and Tools for NMTTips and Tools for NMT
Tips and Tools for NMT
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
The Impact of Corpora Qulality on Neural Machine Translation
The Impact of Corpora Qulality on Neural Machine TranslationThe Impact of Corpora Qulality on Neural Machine Translation
The Impact of Corpora Qulality on Neural Machine Translation
 
Advancing Estonian Machine Translation
Advancing Estonian Machine TranslationAdvancing Estonian Machine Translation
Advancing Estonian Machine Translation
 
Debugging neural machine translations
Debugging neural machine translationsDebugging neural machine translations
Debugging neural machine translations
 
Effective online learning implementation for statistical machine translation
Effective online learning implementation for statistical machine translationEffective online learning implementation for statistical machine translation
Effective online learning implementation for statistical machine translation
 
Neirontulkojumu atkļūdošana
Neirontulkojumu atkļūdošanaNeirontulkojumu atkļūdošana
Neirontulkojumu atkļūdošana
 
Hybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systemsHybrid machine translation by combining multiple machine translation systems
Hybrid machine translation by combining multiple machine translation systems
 
Paying attention to MWEs in NMT
Paying attention to MWEs in NMTPaying attention to MWEs in NMT
Paying attention to MWEs in NMT
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Deep Learning for Machine Translation

  • 1.
  • 3.
  • 4. Schedule • Kevin Duh - Fundamentals for DL4MT I & II • Lab 1 - Prep and setup. Compare logistic regression, MLP, and stacked auto-encoders on data • Lab 2 - word embedding (SGNS), visualization. • Hermann Ney - Neural LMs and TMs for SMT I & II • Lab 1 - Rescore n-best lists using RNN LM • Lab 2 - n-best rescoring using uni/bidirectional translation and joint models • Kyunghyun Cho - Neural MT I & II • Lab 1 - Data Preparation: Basic preprocessing; Encoder-Decoder with Theano (without attention) • Lab 2 - Attention-based Encoder-Decoder with Theano
  • 5. Neural Networks (NN) – brief introduction • An NN consists of: • Multiple layers (an input layer, zero or more hidden layers, and an output layer) that consist of a set of neurones (xi, hj, and y) • Interconnections between nodes of different layers that have weights assigned (wij and wj) • Activation functions for each neurone that convert weighted input of neurons into output values • Deep neural networks –NNs with one or more hidden layers
  • 6. Activation functions • Many different types • Most common functions in NLP – logistic sigmoid and hyperbolic tangent (because of their non- linearity) http://blog.sciencenet.cn/blog-457187-878461.html
  • 7. Training • Neural networks are trained with the backpropagation algorithm • Aim – reduce cost/error of the model by iteratively performing a forward pass, calculating the error, and adjusting weights based on the “direction” (read - derivative) of the error. The error is “backpropagated” from the output layer back to the input layer • A long description with a lot of theory: http://neuralnetworksanddeep learning.com/chap2.html
  • 8. Word Embeddings • A continuous space representation of words • Multi-dimensional vectors • Decimal values • By-product of neural networks where words are vectorised • Skip-gram models • neural network language models • neural machine translation models • etc.
  • 9. Word Embeddings – Skip-Gram model • Input – word (one of N vector) • Output – context of the word • Embeddings – the trained weight matrix W between the input layer and the hidden layer • Each row represents the embedding of a single word • Implementation: word2vec http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
  • 10. Word embeddings are different • The distributional representation of words has to be trained for specific tasks (e.g., Skip-gram word embeddings are not good for translation) • Similar words (according to cosine similarity) of the given words using different word embedding models http://arxiv.org/pdf/1412.6448.pdf
  • 11. Feedforward Neural Net Language Model (NNLM) • Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137-1155, 2003.
  • 12. Feedforward Neural Net Language Model (NNLM) • The same, but with a simpler figure / http://www.cs.cmu.edu/~mfaruqui/talks/nn-clab.pdf /
  • 13. What NNLMs are (supposedly) good at (… what n-gram models never will)?
  • 14. Language Modelling and Machine Translation using Neural Networks Hermann Ney http://ej.uz/NNLM
  • 15. Language Modelling • Conventional Language Modelling • Measure the quality of an LM with perplexity • Problem: most of the events are never seen in training data
  • 16. Language Modelling Using Neural Networks • non ANN: count models (Markov chain): • limited history of predecessor words • smooth relative frequencies • feedforward multi-layer perceptron (FF MLP): • limited history too • use predecessor words as input to MLP • recurrent neural networks (RNN): • advantage: unlimited history
  • 17.
  • 19. Experiments Results with JRC-ACQUIS EN-LV Approach Perplexity CPU time Size 5-gram count language model 48.0376 5 minutes + 2 minutes (binarize) 1118MB 4-gram Feedforward Neural Net Language Model with 2 layers • 1000 wrord classes; • batch-size 64; • learning-rate 5e-3; • 200 nodes per input word, and a subsequent layer of 200 nodes with a sigmoid activation function. 126.9841 1 week 43MB
  • 20.
  • 21. Practicalities of ANN LM Training (Implementation and Software) • no regularization, no momentum term, no drop-out (so far!) • no pre-training (so far!) • vocabulary reduction: remove singletons, or keep most frequent words • random initialization of weights: Gaussian of mean 0, variance 0.01 • training criterion: cross-entropy (perplexity) • stopping: cross-validation, perplexity on a development text • initial learning rate: typically between 1 · e−3 and 1 · e−2 • learning rate: halved when the dev perplexity is worse than the best of previous epochs • use of mini-batches: 4 to 64 • low level implementation in C++ • GPUs (typically) not used for the results presented in this talk
  • 22. Language Modelling Using Neural Networks lab Exercice Sheet: http://ej.uz/NNLMlab https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php • Data preparation • Training on small data • Tuning of hyper-parameters • Modifying the network architecture
  • 25. Joint Language and Translation Model based on Feedforward MLP
  • 28. Experiments Results Syntax-based Multi-System Hybrid Translator + NNLM on JRC-ACQUIS EN-LV Approach BLEU MHyT with 5-gram count language model 22.69 SyMHyT with 5-gram count language model 24.72 SyMHyT with 4-gram Feedforward Neural Net Language Model with 2 layers 23.71
  • 29. Machine Translation using Neural Networks lab • Exercice Sheet: http://ej.uz/NNMTlab • Part 1: N-best Reranking using Neural Network Language Models • Obtain the new 1-best hypotheses • Measure the translation quality • Do reranking and compare to the results obtained before reranking. • Part 2: Neural Network Translation Models • Train a unidirectional translation model • Train a unidirectional joint model • Train a bidirectional translation model • Train a bidirectional joint model • Try to obtain better perplexity values by changing the batch size and learning rate • Part 3: N-best Reranking using Neural Network Translation Models • Apply rescoring using each of the unidirectional and bidirectional translation and joint models • Optimize the model weights with MERT to achieve the best BLEU score on the dev dataset • Evaluate the translation hypotheses for each of the rescoring experiments
  • 30. Neural Machine Translation • Encoder-decoder model https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
  • 32. Neural Machine Translation • Encoder-decoder model • (a) – encoder • (b) - decoder https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf http://devblogs.nvidia.com/parallelforall/introduction- neural-machine-translation-gpus-part-2/
  • 33. Neural Machine Translation • Bi-directional recurrent neural networks for attention-based models http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
  • 34. Neural Machine Translation • Attention-based models • Due to its sequential nature, a recurrent neural network tends to remember recent symbols better • The attention mechanism allows to focus at each time step at the relevant symbols by selecting the appropriate vector that summarises the input sentence http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
  • 35. Attention-based encoder-decoder NMT • English-Latvian • Vocabulary – 30K • Embedding dimensions – 400 • Hidden layer dimensions – 1,024 • Batch size - 14 • Training corpus – DGT-TM (2,401,815 unique sentences) • Trainined on an NVIDIA GeForce GTX 960 (2GB) GPU • Training time - ~4 days and 2 hours (crashed due to out of memory exception) • But ... Luckily it saves models iteratively • During training uses ~40GB of virtual memory • Translation time of 512 sentence test set (however, it includes also the model loading time) • NMT: 19 minutes and 3 seconds (translation with CPU on 6 cores) • LetsMT: 1 minute and 39 seconds (translation with CPU on 1 core) https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2
  • 36. Attention-based encoder-decoder NMT • Comparison with LetsMT (LetsMT – 13.93 BLEU, NMT – 12.42 BLEU) • Translations more fluent (even if not always correct «according to reference»)
  • 37. Attention-based encoder-decoder NMT • Unknown words are a problem
  • 38. Attention-based encoder-decoder NMT • Sometimes the context around unknown words is surprisingly good
  • 39. Attention-based encoder-decoder NMT • Sometimes the NMT creates a translation that is (probably) equally as good as the reference
  • 40. Attention-based encoder-decoder NMT • However, sometimes the translation is also bad (total nonsense)
  • 41. Attention-based encoder-decoder NMT • Named entities that are listed with commas can cause issues
  • 42. Attention- based encoder- decoder NMT • Word alignments • What is different from Giza++? • LetsMT translation: šīs paradigma pamatelements ir jaunā struktūra informācijas apstrādes sistēmu .
  • 43. Attention-based encoder-decoder NMT • Model trained on 50K vocabulary and a bath size of 12 • After 300,000 updates (or 3,600,000 observed sentences) • I.e., model is not yet fully trained • Results: • LetsMT – 13.93 BLEU, NMT – 12.48 BLEU(+0.06) • Not good, but it may improve since the model has not finished training… • After 520,000 updates (6,240,000 observed sentences) • LetsMT – 13.93 BLEU, NMT – 11.88 BLEU(-0.54)
  • 44. Attention-based encoder-decoder NMT • Lessons learned (from the tiny but long experiments): • You need to have a «good» GPU (6GB GDDR may not be enough for systems of a decent vocabulary size) • A 2GB card will not allow building models with a vocabulary that is larger than 30-50K • The «good» GPUs are expensive (>1K€) • Only Nvidia GPUs are currently usable (the existing libraries are built/tuned for CUDA), OpenCL is an alternative that is under-supported • If training with a GPU takes up to a week, training with a CPU is a no go • 30K is miles away from a decent vocabulary • You need to have means to handle unknown words • From the translation quality positive tendencies are evident, but an experiment with a more decent data set and a larger vocabulary is necessary to make more justified judgements
  • 45.
  • 46. TensorFlow https://www.tensorflow.org/ • Deep Flexibility • True Portability • Connect Research and Production • Auto-Differentiation • Language Options • Maximize Performance
  • 47. Theano http://deeplearning.net/software/theano/ • Built for python • Tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions. • Transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only) • Efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs. • Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny. • Dynamic C code generation – Evaluate expressions faster. • Extensive unit-testing and self-verification – Detect and diagnose many types of mistake. • The EN-LV NMT model was trained using Theano • Speed comparison of different NN libraries: https://github.com/soumith/convnet-benchmarks