The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.
Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Attention Is All You Need
1. Attention Is All You Need
Presenter: Illia Polosukhin, NEAR.ai
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia
Polosukhin
Work performed while at Google
2. ● RNNs have
transformed NLP
● State-of-the-art
across many tasks
● Translation has been
recent example of a
large win
Deep Learning for NLP
https://research.googleblog.com/2017/04/introducing-tf-seq2seq-open-source.html
4. ● Hard to parallelization efficiently
● Back propagation through sequence
● Transmitting local and global information
through one bottleneck [hidden state]
Problem with RNNs
5. ● Trying to solve the problems
with Sequence models
● Notable work:
○ Neural GPU
○ ByteNet
○ ConvS2S
● Limited by size of convolution
Convolutional Models
Neural Machine Translation in Linear Time, Kalchbrenner et al.
6. ● Removes bottleneck of
Encoder-Decoder model
● Provides context for given
decoder step
Attention Mechanics
“Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et
al.
7. ● “Inner Attention based Recurrent Neural Networks for Answer Selection”, ACL
2016, Wang et al.
● “Learning Natural Language Inference using Bidirectional LSTM model and
Inner-Attention”, 2016, Liu et al.
● “Long Short-Term Memory-Networks for Machine Reading”, EMNLP 2016,
Cheng et al.
● “A Decomposable Attention Model for Natural Language Inference”, EMNLP
2016, Parikh et al.
Self/Intra/Inner Attention in Literature
13. ● Positional encoding provides relative or absolute position
of given token
● Many options to select positional encoding, in this work:
Fixed offset PEpos+k can be represented as linear function of PEpos
● Alternative, to learn positional embeddings
Positional Encoding
“Neural Machine Translation by Jointly Learning to Align and Translate”, Bahdanau et al.
Multiplicative interaction [Hinton], attention has all that, long range and no bottleneck
Self attention is a learned pooling, multiplicative interaction
In all works except “Decomposable attention” it’s used in conjunction with RNN model.
The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √ dk, and apply a softmax function to obtain the weights on the values. Do this to fix small gradients issue when dimensions of the vectors are large.
Combine pieces from different parts of sub-space
Multiple attention distribution, can focus on different and on smaller encoding to reduce computational
Learned works as well
An example of the attention mechanism following long-distance dependencies in the
encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of
the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for
the word ‘making’. Different colors represent different heads.