SlideShare a Scribd company logo
1 of 36
Attention Is All you need
Reading Seminar
Kyoto University, Kashima lab
Daiki Tanaka
Today’s Paper : ‘Attention Is All you need‘
• Conference : NIPS 2017
• Cited 966 times.
• Authors :
• Ashish Vaswani (Google Brain)
• Noam Shazeer (Google Brain)
• Niki Parmar (Google Research)
• Jakob Uszkoreit (Google Research)
• Llion Jones (Google Research)
• Aidan N. Gomez (University of Toronto)
• Łukasz Kaiser (Google Brain)
• Illia Polosukhin
Background : Natural language processing is the important
application of ML
• There are many tasks which are solved by NLP technique.
• Sentence classification
• Sentence to sentence
• Sentence to tag …
Background : former techniques are not good at parallelization
• RNN, LSTM, and GRN have been SOTA in sequence
modeling problems.
• RNN generates a hidden states, as a function of the previous
hidden states and the input.
• This sequential nature prevents parallelization within
training samples especially when sequence length is long.
• Attention mechanisms have become an integral part of
recent models, but such attention mechanisms are used with
a recurrent network.
Background – challenge of computational cost
• Reducing sequential computation is achieved by using CNN,
or computing hidden states in parallel.
• But in these methods, the number of operations to relate two
input and output positions grows in the distance between
distances. → It is difficult to learn dependencies between
distant positions.
MATOME
• Using attention mechanisms allow us to draw global
dependencies between input and output by a constant
number of operations.
• In this work, they propose the Transformer which doesn’t
use recurrent architecture or convolutional architecture, and
reaches a state-of-the-art in translation quality.
Problem Setting : Sequence to sequence
• Input :
• Sequence of symbol representations (𝑥1, 𝑥2, … , 𝑥 𝑛)
• “This is a pen.”
• Output
• Sequence of symbol representations (𝑦1, 𝑦2, … , 𝑦𝑛)
• 「これ は ペン です。」
Entire Model Architecture
• Left side : Encoder
• Right side : Decoder
• Consisting layers:
• Multi-Head Attention layer
• Position-wise Feed-Forward layer
• Positional Encoding
• (Residual Adding and Normalization layer)
Self attention in Encoder Self attention in Decoder
Encoder-Decoder attention
1. Attentions
What is Attention?
• Attention : (vector, matrix, matrix) →vector
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
What is Attention?
• 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
Attention Weight
Search keys which are
similar to a query, and
return the corresponding
values.
An attention function can be
described as a dictionary object.
What is Attention?
• Single query Attention : (vector, matrix, matrix) →vector
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
• We can input multiple queries at one time :
(matrix, matrix, matrix) →matrix
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
Scaled Dot-Product Attention
• There are two most commonly used attention
functions : additive attention and dot-product
attention.
• Additive attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝜎(𝑊 𝑄, 𝐾 + 𝑏))
• Dot-product attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇
• Dot-product attention is much faster and
space efficient.
• They compute the attention as:
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where 𝑑 𝑘 is the scaling factor preventing
softmax function pushed into regions where it
has extremely small gradients.
Multi-Head Attention
• Instead of calculating single dot-product attention,
they calculate multiple attentions. (for example, h = 8)
• They linearly project Q, K, and V h-times with
different projections to 𝑑 𝑘, 𝑑 𝑘 𝑎𝑛𝑑 𝑑 𝑣 dimensions.
(They use 𝑑 𝑘 = 𝑑 𝑣 =
𝑑 𝑚𝑜𝑑𝑒𝑙
ℎ
)
𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖
𝑄
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
)
Parameter weight matrices
Applications of Multi-Head Attention in model
• Transformer uses multi-head attention in three ways.
1. encoder-decoder attention : The queries(Q) come from
the previous decoder layer, and keys(K) and values(V)
come from the output of the encoder. (traditional
attention mechanisms)
2. Self-attention layers in the encoder : All of the keys(K),
values(V) and queries(Q) come from the same place, in
this case, the output of the previous layer in the
encoder.
3. Self-attention layers in the decoder : K,V,Q come from the
output of the previous layer in the decoder. We need to
prevent leftward information flow in the decoder to
preserve the auto-regressive property.
Data Flow in Attention (Multi-Head)
Input
Output
Cited from : https://jalammar.github.io/illustrated-transformer/
Self-Attention in Decoder
• In the decoder, the self-attention layer is only allowed to
attend to earlier positions in the output sequence. This is
done by masking future positions (setting them to -inf)
before the softmax step in the self-attention calculation.
• For example, if a sentence “I play soccer in the morning.” is
given in the decoder, and we want to apply self-attention for
“soccer” (let “soccer” be a query), we can only attend “I” and
“play” but cannot attend “in”, “the”, and “morning”.
Why Self-Attention?
1. Total computational complexity per layer
2. The amount of compotation that can be parallelized
3. The path length between long-range dependencies
4. (As side benefit, self-attention could yield more interpretable
models.)
2. Feed Forward Layer
Position-wise Feed-Forward Networks
• Feed-forward networks are applied to each position
separately.
𝐹𝐹𝑁 𝑥 = 𝑅𝑒𝐿𝑈 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
x1
x2
x3
x4
x5
x6
x7
word1
Embedding dimension (512)
𝑅𝑒𝐿𝑈 𝑥1 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥2 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥4 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥3 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥7 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥6 𝑊1 + 𝑏1 𝑊2 + 𝑏2
𝑅𝑒𝐿𝑈 𝑥5 𝑊1 + 𝑏1 𝑊2 + 𝑏2
dimension (512)
word2
word3
word4
word5
word7
word6
3. Positional Encoding
Positional Encoding
• Since their model doesn’t contain recurrence and convolution,
it is needed to inject some information about the position
of the tokens in the sequence.
• So they add “positional encoding” to input embedding.
• Each element of PE is as following :
𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin
𝑝𝑜𝑠
10000
2𝑖
𝑑 𝑚𝑜𝑑𝑒𝑙
, 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 + 1 = cos
𝑝𝑜𝑠
10000
2𝑖
𝑑 𝑚𝑜𝑑𝑒𝑙
• 𝑝𝑜𝑠 is the location of the word, 𝑖 is the index of dimension in word
embedding.
Positional Encoding
-1.0~+1.0
Wordindexinsequence
Dimension in embedding vector
Cited from :
https://github.com/soskek/attention_is_all_you_need/blob/ma
ster/Observe_Position_Encoding.ipynb
Positional Encoding
Cited from : https://jalammar.github.io/illustrated-transformer/
4. FC and Softmax layer
Final FC and softmax layer
Cited from : https://jalammar.github.io/illustrated-transformer/
Using Beam-search in selecting model prediction
• When selecting model output, we can take the word with the
highest probability and throw away the rest word
candidates. : greedy decoding
• Another way to select model output is beam-search.
0.1 0.2 0.1 0.6
I you he she
Beam-search
• beam-search
• Instead of only predicting the token with the best score,
we keep track of k hypotheses (for example k=5, we refer
to k as the beam size).
• At each new time step, for these k hypotheses, we have V
new possible tokens. It makes a total of kV new
hypotheses. Then, only keep top k hypotheses, … .
• The length of words to hold is also a parameter.
Experiment- sequence to sequence task
• Data
• WMT2014 English-German : 4.5 million sentence pairs
• WMT2014 English-French : 36 million sentences
• Hardware and Schedule
• 8 NVIDIA P100 GPUs
• Base model : 100,000 steps or 12 hours
• Big model : 300,000 steps (3.5 days)
• Optimizer : Adam
• Warm up, and then decrease learning rate.
Experiment – evaluation metrics
• BLEU (Bilingual Evaluation Understudy )
• Evaluation metrics on how machine translation(MT)
and ground truth(GT) translation are similar.
𝐵𝐿𝐸𝑈 = 𝐵𝑃𝐵𝐿𝐸𝑈 ∙ exp
𝑛=1
𝑁
1
𝑁
𝑙𝑜𝑔𝑝 𝑛
Usually, N=4.
• 𝐵𝑃𝐵𝐿𝐸𝑈 : penalty if multiplied when len(MT) < len(GT)
• 𝑝 𝑛 = 𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑠𝑎𝑚𝑒 𝑖𝑛 𝑀𝑇 𝑖 𝑎𝑛𝑑 𝐺𝑇 𝑖
𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑀𝑇 𝑖
Experiment - Result
Experiment – Changing the parameters in Transformer
Head number
Base model
Big model
Experiment – Self-attention visualization
Completing the phrase ”making … more difficult”
Experiment – Self-attention visualization
Two different heads have
different attention weights.
Conclusion
• They presented the Transformer, the first sequence
transduction model based entirely on attention, replacing the
recurrent layers most commonly used in encoder-decoder
architectures with multi-headed self-attention.
• The Transformer can be trained significantly faster than
architectures based on recurrent or convolutional layers.
See detail at:
• https://jalammar.github.io/illustrated-transformer/
• http://deeplearning.hatenablog.com/entry/transformer

More Related Content

What's hot

Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 

What's hot (20)

Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 
Deep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural NetworksDeep Learning - Convolutional Neural Networks
Deep Learning - Convolutional Neural Networks
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Attention scores and mechanisms
Attention scores and mechanismsAttention scores and mechanisms
Attention scores and mechanisms
 
Word embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTMWord embeddings, RNN, GRU and LSTM
Word embeddings, RNN, GRU and LSTM
 
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
Bart : Denoising Sequence-to-Sequence Pre-training for Natural Language Gener...
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 

Similar to [Paper Reading] Attention is All You Need

Similar to [Paper Reading] Attention is All You Need (20)

Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...Demystifying NLP Transformers: Understanding the Power and Architecture behin...
Demystifying NLP Transformers: Understanding the Power and Architecture behin...
 
Foundations: Artificial Neural Networks
Foundations: Artificial Neural NetworksFoundations: Artificial Neural Networks
Foundations: Artificial Neural Networks
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Recursion and Sorting Algorithms
Recursion and Sorting AlgorithmsRecursion and Sorting Algorithms
Recursion and Sorting Algorithms
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018Apache MXNet ODSC West 2018
Apache MXNet ODSC West 2018
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
Neural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptxNeural machine translation by jointly learning to align and translate.pptx
Neural machine translation by jointly learning to align and translate.pptx
 
recurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptxrecurrent_neural_networks_april_2020.pptx
recurrent_neural_networks_april_2020.pptx
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
Deep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural NetworksDeep Implicit Layers: Learning Structured Problems with Neural Networks
Deep Implicit Layers: Learning Structured Problems with Neural Networks
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 

More from Daiki Tanaka

The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
Daiki Tanaka
 
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social TiesThe Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
Daiki Tanaka
 
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Daiki Tanaka
 

More from Daiki Tanaka (13)

[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
[Paper Reading] Theoretical Analysis of Self-Training with Deep Networks on U...
 
カーネル法:正定値カーネルの理論
カーネル法:正定値カーネルの理論カーネル法:正定値カーネルの理論
カーネル法:正定値カーネルの理論
 
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
[Paper Reading] Causal Bandits: Learning Good Interventions via Causal Inference
 
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
[Paper reading] L-SHAPLEY AND C-SHAPLEY: EFFICIENT MODEL INTERPRETATION FOR S...
 
Selective inference
Selective inferenceSelective inference
Selective inference
 
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
Anomaly Detection with VAEGAN and Attention [JSAI2019 report]
 
オンライン学習 : Online learning
オンライン学習 : Online learningオンライン学習 : Online learning
オンライン学習 : Online learning
 
Local Outlier Detection with Interpretation
Local Outlier Detection with InterpretationLocal Outlier Detection with Interpretation
Local Outlier Detection with Interpretation
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
 
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
The Million Domain Challenge: Broadcast Email Prioritization by Cross-domain ...
 
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social TiesThe Limits of Popularity-Based Recommendations, and the Role of Social Ties
The Limits of Popularity-Based Recommendations, and the Role of Social Ties
 
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
Learning Deep Representation from Big and Heterogeneous Data for Traffic Acci...
 
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series DataToeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

[Paper Reading] Attention is All You Need

  • 1. Attention Is All you need Reading Seminar Kyoto University, Kashima lab Daiki Tanaka
  • 2. Today’s Paper : ‘Attention Is All you need‘ • Conference : NIPS 2017 • Cited 966 times. • Authors : • Ashish Vaswani (Google Brain) • Noam Shazeer (Google Brain) • Niki Parmar (Google Research) • Jakob Uszkoreit (Google Research) • Llion Jones (Google Research) • Aidan N. Gomez (University of Toronto) • Łukasz Kaiser (Google Brain) • Illia Polosukhin
  • 3. Background : Natural language processing is the important application of ML • There are many tasks which are solved by NLP technique. • Sentence classification • Sentence to sentence • Sentence to tag …
  • 4. Background : former techniques are not good at parallelization • RNN, LSTM, and GRN have been SOTA in sequence modeling problems. • RNN generates a hidden states, as a function of the previous hidden states and the input. • This sequential nature prevents parallelization within training samples especially when sequence length is long. • Attention mechanisms have become an integral part of recent models, but such attention mechanisms are used with a recurrent network.
  • 5. Background – challenge of computational cost • Reducing sequential computation is achieved by using CNN, or computing hidden states in parallel. • But in these methods, the number of operations to relate two input and output positions grows in the distance between distances. → It is difficult to learn dependencies between distant positions.
  • 6. MATOME • Using attention mechanisms allow us to draw global dependencies between input and output by a constant number of operations. • In this work, they propose the Transformer which doesn’t use recurrent architecture or convolutional architecture, and reaches a state-of-the-art in translation quality.
  • 7. Problem Setting : Sequence to sequence • Input : • Sequence of symbol representations (𝑥1, 𝑥2, … , 𝑥 𝑛) • “This is a pen.” • Output • Sequence of symbol representations (𝑦1, 𝑦2, … , 𝑦𝑛) • 「これ は ペン です。」
  • 8. Entire Model Architecture • Left side : Encoder • Right side : Decoder • Consisting layers: • Multi-Head Attention layer • Position-wise Feed-Forward layer • Positional Encoding • (Residual Adding and Normalization layer)
  • 9. Self attention in Encoder Self attention in Decoder Encoder-Decoder attention 1. Attentions
  • 10. What is Attention? • Attention : (vector, matrix, matrix) →vector 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
  • 11. What is Attention? • 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒 Attention Weight Search keys which are similar to a query, and return the corresponding values. An attention function can be described as a dictionary object.
  • 12. What is Attention? • Single query Attention : (vector, matrix, matrix) →vector 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑞𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑞𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒 • We can input multiple queries at one time : (matrix, matrix, matrix) →matrix 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄𝑢𝑒𝑟𝑦, 𝐾𝑒𝑦, 𝑉𝑎𝑙𝑢𝑒 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝑢𝑒𝑟𝑦 ∙ 𝐾𝑒𝑦 𝑇 ∙ 𝑉𝑎𝑙𝑢𝑒
  • 13. Scaled Dot-Product Attention • There are two most commonly used attention functions : additive attention and dot-product attention. • Additive attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝜎(𝑊 𝑄, 𝐾 + 𝑏)) • Dot-product attention : 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇 • Dot-product attention is much faster and space efficient. • They compute the attention as: 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 𝑄, 𝐾, 𝑉 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 where 𝑑 𝑘 is the scaling factor preventing softmax function pushed into regions where it has extremely small gradients.
  • 14. Multi-Head Attention • Instead of calculating single dot-product attention, they calculate multiple attentions. (for example, h = 8) • They linearly project Q, K, and V h-times with different projections to 𝑑 𝑘, 𝑑 𝑘 𝑎𝑛𝑑 𝑑 𝑣 dimensions. (They use 𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙 ℎ ) 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 𝑄, 𝐾, 𝑉 = 𝐶𝑜𝑛𝑐𝑎𝑡 ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂 𝑤ℎ𝑒𝑟𝑒 ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖 𝑄 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 ) Parameter weight matrices
  • 15. Applications of Multi-Head Attention in model • Transformer uses multi-head attention in three ways. 1. encoder-decoder attention : The queries(Q) come from the previous decoder layer, and keys(K) and values(V) come from the output of the encoder. (traditional attention mechanisms) 2. Self-attention layers in the encoder : All of the keys(K), values(V) and queries(Q) come from the same place, in this case, the output of the previous layer in the encoder. 3. Self-attention layers in the decoder : K,V,Q come from the output of the previous layer in the decoder. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property.
  • 16. Data Flow in Attention (Multi-Head) Input Output Cited from : https://jalammar.github.io/illustrated-transformer/
  • 17. Self-Attention in Decoder • In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. • For example, if a sentence “I play soccer in the morning.” is given in the decoder, and we want to apply self-attention for “soccer” (let “soccer” be a query), we can only attend “I” and “play” but cannot attend “in”, “the”, and “morning”.
  • 18. Why Self-Attention? 1. Total computational complexity per layer 2. The amount of compotation that can be parallelized 3. The path length between long-range dependencies 4. (As side benefit, self-attention could yield more interpretable models.)
  • 20. Position-wise Feed-Forward Networks • Feed-forward networks are applied to each position separately. 𝐹𝐹𝑁 𝑥 = 𝑅𝑒𝐿𝑈 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 x1 x2 x3 x4 x5 x6 x7 word1 Embedding dimension (512) 𝑅𝑒𝐿𝑈 𝑥1 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥2 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥4 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥3 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥7 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥6 𝑊1 + 𝑏1 𝑊2 + 𝑏2 𝑅𝑒𝐿𝑈 𝑥5 𝑊1 + 𝑏1 𝑊2 + 𝑏2 dimension (512) word2 word3 word4 word5 word7 word6
  • 22. Positional Encoding • Since their model doesn’t contain recurrence and convolution, it is needed to inject some information about the position of the tokens in the sequence. • So they add “positional encoding” to input embedding. • Each element of PE is as following : 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 = sin 𝑝𝑜𝑠 10000 2𝑖 𝑑 𝑚𝑜𝑑𝑒𝑙 , 𝑃𝐸 𝑝𝑜𝑠, 2𝑖 + 1 = cos 𝑝𝑜𝑠 10000 2𝑖 𝑑 𝑚𝑜𝑑𝑒𝑙 • 𝑝𝑜𝑠 is the location of the word, 𝑖 is the index of dimension in word embedding.
  • 23. Positional Encoding -1.0~+1.0 Wordindexinsequence Dimension in embedding vector Cited from : https://github.com/soskek/attention_is_all_you_need/blob/ma ster/Observe_Position_Encoding.ipynb
  • 24. Positional Encoding Cited from : https://jalammar.github.io/illustrated-transformer/
  • 25. 4. FC and Softmax layer
  • 26. Final FC and softmax layer Cited from : https://jalammar.github.io/illustrated-transformer/
  • 27. Using Beam-search in selecting model prediction • When selecting model output, we can take the word with the highest probability and throw away the rest word candidates. : greedy decoding • Another way to select model output is beam-search. 0.1 0.2 0.1 0.6 I you he she
  • 28. Beam-search • beam-search • Instead of only predicting the token with the best score, we keep track of k hypotheses (for example k=5, we refer to k as the beam size). • At each new time step, for these k hypotheses, we have V new possible tokens. It makes a total of kV new hypotheses. Then, only keep top k hypotheses, … . • The length of words to hold is also a parameter.
  • 29. Experiment- sequence to sequence task • Data • WMT2014 English-German : 4.5 million sentence pairs • WMT2014 English-French : 36 million sentences • Hardware and Schedule • 8 NVIDIA P100 GPUs • Base model : 100,000 steps or 12 hours • Big model : 300,000 steps (3.5 days) • Optimizer : Adam • Warm up, and then decrease learning rate.
  • 30. Experiment – evaluation metrics • BLEU (Bilingual Evaluation Understudy ) • Evaluation metrics on how machine translation(MT) and ground truth(GT) translation are similar. 𝐵𝐿𝐸𝑈 = 𝐵𝑃𝐵𝐿𝐸𝑈 ∙ exp 𝑛=1 𝑁 1 𝑁 𝑙𝑜𝑔𝑝 𝑛 Usually, N=4. • 𝐵𝑃𝐵𝐿𝐸𝑈 : penalty if multiplied when len(MT) < len(GT) • 𝑝 𝑛 = 𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑠𝑎𝑚𝑒 𝑖𝑛 𝑀𝑇 𝑖 𝑎𝑛𝑑 𝐺𝑇 𝑖 𝑖 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛 𝑀𝑇 𝑖
  • 32. Experiment – Changing the parameters in Transformer Head number Base model Big model
  • 33. Experiment – Self-attention visualization Completing the phrase ”making … more difficult”
  • 34. Experiment – Self-attention visualization Two different heads have different attention weights.
  • 35. Conclusion • They presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention. • The Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers.
  • 36. See detail at: • https://jalammar.github.io/illustrated-transformer/ • http://deeplearning.hatenablog.com/entry/transformer

Editor's Notes

  1. memory を KeyKey と ValueValue に分離することで keykey と valuevalue 間の非自明な変換によって高い表現力が得られるという
  2. memory を KeyKey と ValueValue に分離することで keykey と valuevalue 間の非自明な変換によって高い表現力が得られるという
  3.  self-attention layers are faster than recurrent layers when the sequence length n  is smaller than the representation dimensionality d.
  4. Different colors show different heads..
  5. Different colors show different heads..
  6. Different colors show different heads..