4. Schedule
• Kevin Duh - Fundamentals for DL4MT I & II
• Lab 1 - Prep and setup. Compare logistic regression, MLP,
and stacked auto-encoders on data
• Lab 2 - word embedding (SGNS), visualization.
• Hermann Ney - Neural LMs and TMs for SMT I & II
• Lab 1 - Rescore n-best lists using RNN LM
• Lab 2 - n-best rescoring using uni/bidirectional translation and joint models
• Kyunghyun Cho - Neural MT I & II
• Lab 1 - Data Preparation: Basic preprocessing; Encoder-Decoder with Theano
(without attention)
• Lab 2 - Attention-based Encoder-Decoder with Theano
5. Neural Networks (NN) – brief introduction
• An NN consists of:
• Multiple layers (an input layer, zero
or more hidden layers, and an
output layer) that consist of a set of
neurones (xi, hj, and y)
• Interconnections between nodes of
different layers that have weights
assigned (wij and wj)
• Activation functions for each
neurone that convert weighted
input of neurons into output values
• Deep neural networks –NNs with
one or more hidden layers
6. Activation
functions
• Many different
types
• Most common
functions in NLP –
logistic sigmoid
and hyperbolic
tangent (because
of their non-
linearity)
http://blog.sciencenet.cn/blog-457187-878461.html
7. Training
• Neural networks are trained
with the backpropagation
algorithm
• Aim – reduce cost/error of the
model by iteratively performing a
forward pass, calculating the
error, and adjusting weights
based on the “direction” (read -
derivative) of the error. The error
is “backpropagated” from the
output layer back to the input
layer
• A long description with a lot of
theory:
http://neuralnetworksanddeep
learning.com/chap2.html
8. Word Embeddings
• A continuous space representation of words
• Multi-dimensional vectors
• Decimal values
• By-product of neural networks where words are vectorised
• Skip-gram models
• neural network language models
• neural machine translation models
• etc.
9. Word Embeddings –
Skip-Gram model
• Input – word (one of N vector)
• Output – context of the word
• Embeddings – the trained weight
matrix W between the input layer
and the hidden layer
• Each row represents the embedding
of a single word
• Implementation: word2vec
http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
10. Word embeddings are different
• The distributional representation of words has to be trained for specific tasks
(e.g., Skip-gram word embeddings are not good for translation)
• Similar words (according to cosine similarity) of the given words using different
word embedding models
http://arxiv.org/pdf/1412.6448.pdf
11. Feedforward Neural Net Language Model (NNLM)
• Y. Bengio, R.
Ducharme, P. Vincent.
A neural probabilistic
language model.
Journal of Machine
Learning Research,
3:1137-1155, 2003.
12. Feedforward Neural Net Language Model (NNLM)
• The same,
but with a
simpler
figure
/ http://www.cs.cmu.edu/~mfaruqui/talks/nn-clab.pdf /
13. What NNLMs are (supposedly) good at
(… what n-gram models never will)?
14. Language Modelling and Machine
Translation using Neural Networks
Hermann Ney
http://ej.uz/NNLM
15. Language Modelling
• Conventional Language Modelling
• Measure the quality of an LM with perplexity
• Problem: most of the events are never seen in training data
16. Language Modelling Using Neural Networks
• non ANN: count models (Markov chain):
• limited history of predecessor words
• smooth relative frequencies
• feedforward multi-layer perceptron (FF MLP):
• limited history too
• use predecessor words as input to MLP
• recurrent neural networks (RNN):
• advantage: unlimited history
19. Experiments Results
with JRC-ACQUIS EN-LV
Approach Perplexity CPU time Size
5-gram count language model 48.0376 5 minutes + 2 minutes (binarize) 1118MB
4-gram Feedforward Neural Net
Language Model with 2 layers
• 1000 wrord classes;
• batch-size 64;
• learning-rate 5e-3;
• 200 nodes per input word, and a
subsequent layer of 200 nodes
with a sigmoid activation
function.
126.9841 1 week 43MB
20.
21. Practicalities of ANN LM Training
(Implementation and Software)
• no regularization, no momentum term, no drop-out (so far!)
• no pre-training (so far!)
• vocabulary reduction: remove singletons, or keep most frequent words
• random initialization of weights: Gaussian of mean 0, variance 0.01
• training criterion: cross-entropy (perplexity)
• stopping: cross-validation, perplexity on a development text
• initial learning rate: typically between 1 · e−3 and 1 · e−2
• learning rate: halved when the dev perplexity is worse than the best of
previous epochs
• use of mini-batches: 4 to 64
• low level implementation in C++
• GPUs (typically) not used for the results presented in this talk
22. Language Modelling Using Neural Networks lab
Exercice Sheet: http://ej.uz/NNLMlab
https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php
• Data preparation
• Training on small data
• Tuning of hyper-parameters
• Modifying the network architecture
28. Experiments Results
Syntax-based Multi-System Hybrid Translator + NNLM on JRC-ACQUIS EN-LV
Approach BLEU
MHyT with 5-gram count language model 22.69
SyMHyT with 5-gram count language model 24.72
SyMHyT with 4-gram Feedforward Neural Net Language Model with 2 layers 23.71
29. Machine Translation using Neural Networks lab
• Exercice Sheet: http://ej.uz/NNMTlab
• Part 1: N-best Reranking using Neural Network Language Models
• Obtain the new 1-best hypotheses
• Measure the translation quality
• Do reranking and compare to the results obtained before reranking.
• Part 2: Neural Network Translation Models
• Train a unidirectional translation model
• Train a unidirectional joint model
• Train a bidirectional translation model
• Train a bidirectional joint model
• Try to obtain better perplexity values by changing the batch size and learning rate
• Part 3: N-best Reranking using Neural Network Translation Models
• Apply rescoring using each of the unidirectional and bidirectional translation and joint models
• Optimize the model weights with MERT to achieve the best BLEU score on the dev dataset
• Evaluate the translation hypotheses for each of the rescoring experiments
30. Neural Machine Translation
• Encoder-decoder model
https://github.com/nyu-dl/NLP_DL_Lecture_Note/blob/master/lecture_note.pdf
34. Neural Machine Translation
• Attention-based models
• Due to its sequential nature, a
recurrent neural network
tends to remember recent
symbols better
• The attention mechanism
allows to focus at each time
step at the relevant symbols by
selecting the appropriate
vector that summarises the
input sentence
http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
35. Attention-based encoder-decoder NMT
• English-Latvian
• Vocabulary – 30K
• Embedding dimensions – 400
• Hidden layer dimensions – 1,024
• Batch size - 14
• Training corpus – DGT-TM (2,401,815 unique sentences)
• Trainined on an NVIDIA GeForce GTX 960 (2GB) GPU
• Training time - ~4 days and 2 hours (crashed due to out of memory exception)
• But ... Luckily it saves models iteratively
• During training uses ~40GB of virtual memory
• Translation time of 512 sentence test set
(however, it includes also the model loading time)
• NMT: 19 minutes and 3 seconds (translation with CPU on 6 cores)
• LetsMT: 1 minute and 39 seconds (translation with CPU on 1 core)
https://github.com/kyunghyuncho/dl4mt-material/tree/master/session2
36. Attention-based encoder-decoder NMT
• Comparison with LetsMT (LetsMT – 13.93 BLEU, NMT – 12.42 BLEU)
• Translations more fluent (even if not always correct «according to reference»)
43. Attention-based encoder-decoder NMT
• Model trained on 50K vocabulary and a bath size of 12
• After 300,000 updates (or 3,600,000 observed sentences)
• I.e., model is not yet fully trained
• Results:
• LetsMT – 13.93 BLEU, NMT – 12.48 BLEU(+0.06)
• Not good, but it may improve since the model has not finished training…
• After 520,000 updates (6,240,000 observed sentences)
• LetsMT – 13.93 BLEU, NMT – 11.88 BLEU(-0.54)
44. Attention-based encoder-decoder NMT
• Lessons learned (from the tiny but long experiments):
• You need to have a «good» GPU (6GB GDDR may not be enough for systems of a decent
vocabulary size)
• A 2GB card will not allow building models with a vocabulary that is larger than 30-50K
• The «good» GPUs are expensive (>1K€)
• Only Nvidia GPUs are currently usable (the existing libraries are built/tuned for CUDA), OpenCL is an
alternative that is under-supported
• If training with a GPU takes up to a week, training with a CPU is a no go
• 30K is miles away from a decent vocabulary
• You need to have means to handle unknown words
• From the translation quality positive tendencies are evident, but an experiment with a
more decent data set and a larger vocabulary is necessary to make more justified
judgements
47. Theano
http://deeplearning.net/software/theano/
• Built for python
• Tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
• Transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
• Efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
• Speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
• Dynamic C code generation – Evaluate expressions faster.
• Extensive unit-testing and self-verification – Detect and diagnose many types of mistake.
• The EN-LV NMT model was trained using Theano
• Speed comparison of different NN libraries: https://github.com/soumith/convnet-benchmarks