SlideShare a Scribd company logo
1 of 44
Download to read offline
Paper Review
Attention Is All You Need
(Ashish et al. 2017) [Arxiv pre-print link]
Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html
Santiago Pascual de la Puente
June 07, 2018
TALP UPC, Barcelona
Table of contents
1. Introduction
2. The Transformer
A Myriad of Attentions
Point-Wise Feed Forward Networks
The Transformer Block
3. Interfacing Token Sequences
Embeddings
Positional Encoding
4. Results
5. Conclusions
1/37
Introduction
Introduction
Recurrent neural networks (RNNs) and their cell variants are firmly
established as state of the art in sequence modeling and transduction
(e.g. machine translation).
In transduction we map a sequence X = {x1, · · · , xT } to another one
Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde
and
ym ∈ Rdd
.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
2/37
Introduction
1. The encoder RNN will encode source symbols X = {x1, · · · , xT }
into useful abstractions to mix up contextual contents →
H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b).
2. Last encoder state hT is typically taken as the summary of the
input, and it is injected into the decoder initial state hd
0 = hT .
3. The decoder RNN will generate one-by-one the target sequence
(autoregressive) by feeding back its previous prediction ym−1 as
input, also conditioned in h0 encoder summarization.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb
3/37
Introduction
Encoding a sentence into one vector would be super amazing, but it is
unfeasible. In the real world we need a mechanism that gives the decoder
hints on where to look from encoder to weight the source vectors, not
just get the last → ATTENTION MECHANISM.
• cm =
T−1
t=0 αm
t · ht
• Each cm is a row (and
additional input to dec), and
each αm
t is an orange square.
https://github.com/spro/practical-pytorch/blob/master/seq2seq-
translation/seq2seq-translation.ipynb 4/37
Introduction
• RNNs factor computation along symbol time positions, generating
ht out of ht−1 → cannot parallelize in training:
ht = tanh(Wxt + Uht−1 + b)
• Attention is used with SOTA transduction RNNs → model
dependencies without regard to their distance in the input or output
sequences.
5/37
Introduction
• Let’s get rid of recurrence and rely entirely on attentions to draw
global dependencies b/w input and output.
• The Transformer is born, significantly boosting parallelization and
reaching new SOTA in translation.
6/37
The Transformer
The Transformer
We will have a new encoder-decoder structure, without any recurrence:
only fully connected layers (independent at every time-step) and
self-attention to merge global info in the sequences.
• Encoder will map X = {x1, · · · , xT } to a sequence of continuous
representations Z = {z1, · · · , zT }.
• Given Z the decoder will generate Y = {y1, · · · , yN }
• Still auto-regressive! But no recurrent connections at all.
7/37
The Transformer
8/37
Attention Generic Formulation
• Attention function maps a query and a set of key-value pairs to an
output: query, keys, values, and output are all vectors:
o = f (q, k, v)
• Output is computed as a weighted sum of the values.
• Weight assigned to each value is computed by a compatibility
function of the query with the corresponding key.
o =
T−1
t=0
g(qi , ki
t ) · vt
9/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
10/37
Scaled Dot-Product Attention
• Input: queries and keys of dimension dk and values of dimension dv .
• Compute the dot products of the query with all keys, divide each by
2
√
dk and apply Softmax → obtain weights on the values.
• FAST TRICK: compute the att on a set of queries simultaneously,
packing matrices Q, K, V .
Attention(Q, K, V ) = Softmax(
QKT
2
√
dk
)V
11/37
The Fault In Our Scale
Wait... why do we scale the output from the matching function between
query and key by 2
√
dk ?
12/37
The Fault In Our Scale
Two most commonly used attention methods (to merge k and q):
• Additive: MLP with one hidden layer where vectors are
concatenated at input of MLP.
• Multiplicative: dot-product seen here → MUCH faster and more
space-efficient.
For small values of dk both behave similarly, but additive outperforms
dot-product for larger dk .
Suspicion: for large values dk , dot-products grow large in magnitude,
pushing Softmax into regions with extremely small gradients.
Assume components of q and k are independent random variables with
µ = 0 and σ = 1 ⇒ is q · k =
dk
i=1 qi · ki with µ = 0 and σ = 2
√
dk .
We counteract this effect by scaling 1
2
√
dk
.
13/37
Multi-Head Attention
Multi-head attention allows the model to jointly attend to information
from different representation subspaces at different positions. With a
single attention head, averaging inhibits this.
14/37
Multi-Head Attention
MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0
headi = Attention(QW Q
i , KW K
i , VW V
i )
W Q
i ∈ Rdmodel ×dk
, W K
i ∈ Rdmodel ×dk
, W V
i ∈ Rdmodel ×dv
, W 0
∈ Rhdv ×dmodel
In this work h = 8 and dk = dv = dmodel
h = 64.
15/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
16/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
17/37
Multi-Head Attention
Transformer uses multi-head attention in three different ways:
1. Encoder-decoder attention layers: queries come from previous
decoder layer, and keys and values come from output of the encoder.
Every position in the decoder attends over all positions in the input
sequence. (Same type of attention as classical seq2seq).
2. Encoder contains self-attention layers: all keys, values and queries
come from same place, the previous encoder layer output. Thus
each position in the encoder can attend to all positions in the
encoder’s previous layer.
3. The decoder has the same self-attention mechanism. BUT!...
prevent leftward information flow (it must be autoregressive).
18/37
Decoder Attention Mask
Prevent leftward information flow inside of scaled dot-product attention,
by masking out (setting to − inf) all values in the input of the Softmax
which correspond to ”illegal” connections.
19/37
Point-Wise Feed Forward Networks
Simply an MLP to each time position with the same parameters:
FFN(x) = max(0, xW1 + b1)W2 + b2
These can be seen as two Convolutions1D with kwidth = 1. The
dimensionality of input and output is dmodel = 512 and inner layer has
dimensionality dff = 2048.
20/37
Point-Wise Feed Forward Networks
21/37
The Transformer Block
If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN,
a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain
the Transformer block:
22/37
The Transformer Block
We can see how N stacks of these blocks form the whole Transformer
END-TO-END network. Note the extra enc-dec-attention in the
decoder blocks.
23/37
Interfacing Token Sequences
Embeddings
As in seq2seq models, we use learned embeddings to convert input tokens
and output tokens to dense vectors of dimension dmodel . There is also (of
course) an output linear transformation to go from dmodel to number of
classes and Softmax.
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
24/37
Embeddings
In the Transformer, all these 3 matrices are tied (same parameters apply),
and in the embeddings layers weights are multiplied by 2
√
dmodel .
25/37
Embeddings
26/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact?
27/37
Positional Encoding
• Are we processing sequences? YES.
• Are we taking care of this fact? NO.
So let’s work it out.
28/37
Positional Encoding
• In order for the model to make use of the order of the sequence, we
must inject some information about the relative or absolute position
of the tokens in the sequence.
• Add positional encodings joint with the embeddings, summing them
up such that the positional info is merged in the input.
PE(pos, 2i ) = sin(
pos
10000
2i
dmodel
)
PE(pos, 2i+1) = cos(
pos
10000
2i
dmodel
)
Where i is the dimension and pos the position (time-step). Each
dimension corresponds to a sinusoid, with wavelengths forming a
geometric progression. The frequency and offset of the wave is different
for each dimension.
29/37
Positional Encoding
At every time-step we will have a combination of sinusoids telling us
where are relative to the beginning (with combination of phases).
Advantage of these codes: generalization to any length in test (cyclic
nature of sinusoids rather than growing indefinitely).
30/37
Results
Results
31/37
Results
• On the WMT 2014 English-to-German translation task, the big
transformer model (Transformer (big)) outperforms the best
previously reported models (including ensembles) by more than 2.0
BLEU! (new SOTA of 28.4).
• Training took 3.5 days on 8 P100 GPUs. Even their base model
surpasses all previously published models and ensembles, at a
fraction of the training cost of any of the competitive models.
• On the WMT 2014 English-to-French translation task, the big
model achieves a BLEU score of 41.0, outperforming all of the
previously published single models, at less than 1
4 the training cost.
32/37
Results
Enc Layer2
33/37
Results
Enc Layer6
34/37
Results
Dec Layer2
35/37
Results
Dec-SRC Layer2
36/37
Conclusions
Conclusions
• The Transformer is the first sequence transduction model based
entirely on attention ( replacing the recurrent layers most commonly
used in encoder-decoder architectures with multi-headed
self-attention).
• For translation tasks, the Transformer can be trained significantly
faster than architectures based on recurrent or convolutional layer.
• New SOTA on WMT 2014 English-to-German and WMT 2014
English-to-French translation tasks.
• Code used to train and evaluate original models is available at
https://github.com/tensorflow/tensor2tensor. .
37/37
Thanks!
@santty128
37/37

More Related Content

What's hot

Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
Abdullah al Mamun
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 

What's hot (20)

[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
Transformer xl
Transformer xlTransformer xl
Transformer xl
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Deep Generative Models
Deep Generative Models Deep Generative Models
Deep Generative Models
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
State of transformers in Computer Vision
State of transformers in Computer VisionState of transformers in Computer Vision
State of transformers in Computer Vision
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
RajJain516913
 

Similar to Attention is all you need (UPC Reading Group 2018, by Santi Pascual) (20)

240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Chapter 4 pc
Chapter 4 pcChapter 4 pc
Chapter 4 pc
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
ECE 565 FInal Project
ECE 565 FInal ProjectECE 565 FInal Project
ECE 565 FInal Project
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
Parallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDAParallel Implementation of K Means Clustering on CUDA
Parallel Implementation of K Means Clustering on CUDA
 
Loop parallelization & pipelining
Loop parallelization & pipeliningLoop parallelization & pipelining
Loop parallelization & pipelining
 
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
Set Transfomer: A Framework for Attention-based Permutaion-Invariant Neural N...
 
SPAA11
SPAA11SPAA11
SPAA11
 
Dsp manual print
Dsp manual printDsp manual print
Dsp manual print
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
AI optimizing HPC simulations (presentation from  6th EULAG Workshop)AI optimizing HPC simulations (presentation from  6th EULAG Workshop)
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
 
From RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphsFrom RNN to neural networks for cyclic undirected graphs
From RNN to neural networks for cyclic undirected graphs
 
Fx3111501156
Fx3111501156Fx3111501156
Fx3111501156
 

More from Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Attention is all you need (UPC Reading Group 2018, by Santi Pascual)

  • 1. Paper Review Attention Is All You Need (Ashish et al. 2017) [Arxiv pre-print link] Strong reference: http://nlp.seas.harvard.edu/2018/04/03/attention.html Santiago Pascual de la Puente June 07, 2018 TALP UPC, Barcelona
  • 2. Table of contents 1. Introduction 2. The Transformer A Myriad of Attentions Point-Wise Feed Forward Networks The Transformer Block 3. Interfacing Token Sequences Embeddings Positional Encoding 4. Results 5. Conclusions 1/37
  • 4. Introduction Recurrent neural networks (RNNs) and their cell variants are firmly established as state of the art in sequence modeling and transduction (e.g. machine translation). In transduction we map a sequence X = {x1, · · · , xT } to another one Y = {y1, · · · , yM } where T and M can be different, xt ∈ Rde and ym ∈ Rdd . https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 2/37
  • 5. Introduction 1. The encoder RNN will encode source symbols X = {x1, · · · , xT } into useful abstractions to mix up contextual contents → H = {h1, · · · , hT }, where ht = tanh(Wxt + Uht−1 + b). 2. Last encoder state hT is typically taken as the summary of the input, and it is injected into the decoder initial state hd 0 = hT . 3. The decoder RNN will generate one-by-one the target sequence (autoregressive) by feeding back its previous prediction ym−1 as input, also conditioned in h0 encoder summarization. https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb 3/37
  • 6. Introduction Encoding a sentence into one vector would be super amazing, but it is unfeasible. In the real world we need a mechanism that gives the decoder hints on where to look from encoder to weight the source vectors, not just get the last → ATTENTION MECHANISM. • cm = T−1 t=0 αm t · ht • Each cm is a row (and additional input to dec), and each αm t is an orange square. https://github.com/spro/practical-pytorch/blob/master/seq2seq- translation/seq2seq-translation.ipynb 4/37
  • 7. Introduction • RNNs factor computation along symbol time positions, generating ht out of ht−1 → cannot parallelize in training: ht = tanh(Wxt + Uht−1 + b) • Attention is used with SOTA transduction RNNs → model dependencies without regard to their distance in the input or output sequences. 5/37
  • 8. Introduction • Let’s get rid of recurrence and rely entirely on attentions to draw global dependencies b/w input and output. • The Transformer is born, significantly boosting parallelization and reaching new SOTA in translation. 6/37
  • 10. The Transformer We will have a new encoder-decoder structure, without any recurrence: only fully connected layers (independent at every time-step) and self-attention to merge global info in the sequences. • Encoder will map X = {x1, · · · , xT } to a sequence of continuous representations Z = {z1, · · · , zT }. • Given Z the decoder will generate Y = {y1, · · · , yN } • Still auto-regressive! But no recurrent connections at all. 7/37
  • 12. Attention Generic Formulation • Attention function maps a query and a set of key-value pairs to an output: query, keys, values, and output are all vectors: o = f (q, k, v) • Output is computed as a weighted sum of the values. • Weight assigned to each value is computed by a compatibility function of the query with the corresponding key. o = T−1 t=0 g(qi , ki t ) · vt 9/37
  • 13. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. 10/37
  • 14. Scaled Dot-Product Attention • Input: queries and keys of dimension dk and values of dimension dv . • Compute the dot products of the query with all keys, divide each by 2 √ dk and apply Softmax → obtain weights on the values. • FAST TRICK: compute the att on a set of queries simultaneously, packing matrices Q, K, V . Attention(Q, K, V ) = Softmax( QKT 2 √ dk )V 11/37
  • 15. The Fault In Our Scale Wait... why do we scale the output from the matching function between query and key by 2 √ dk ? 12/37
  • 16. The Fault In Our Scale Two most commonly used attention methods (to merge k and q): • Additive: MLP with one hidden layer where vectors are concatenated at input of MLP. • Multiplicative: dot-product seen here → MUCH faster and more space-efficient. For small values of dk both behave similarly, but additive outperforms dot-product for larger dk . Suspicion: for large values dk , dot-products grow large in magnitude, pushing Softmax into regions with extremely small gradients. Assume components of q and k are independent random variables with µ = 0 and σ = 1 ⇒ is q · k = dk i=1 qi · ki with µ = 0 and σ = 2 √ dk . We counteract this effect by scaling 1 2 √ dk . 13/37
  • 17. Multi-Head Attention Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this. 14/37
  • 18. Multi-Head Attention MultiHead(Q, K, V ) = Concat(head1, · · · .headh)W0 headi = Attention(QW Q i , KW K i , VW V i ) W Q i ∈ Rdmodel ×dk , W K i ∈ Rdmodel ×dk , W V i ∈ Rdmodel ×dv , W 0 ∈ Rhdv ×dmodel In this work h = 8 and dk = dv = dmodel h = 64. 15/37
  • 19. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 16/37
  • 20. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 17/37
  • 21. Multi-Head Attention Transformer uses multi-head attention in three different ways: 1. Encoder-decoder attention layers: queries come from previous decoder layer, and keys and values come from output of the encoder. Every position in the decoder attends over all positions in the input sequence. (Same type of attention as classical seq2seq). 2. Encoder contains self-attention layers: all keys, values and queries come from same place, the previous encoder layer output. Thus each position in the encoder can attend to all positions in the encoder’s previous layer. 3. The decoder has the same self-attention mechanism. BUT!... prevent leftward information flow (it must be autoregressive). 18/37
  • 22. Decoder Attention Mask Prevent leftward information flow inside of scaled dot-product attention, by masking out (setting to − inf) all values in the input of the Softmax which correspond to ”illegal” connections. 19/37
  • 23. Point-Wise Feed Forward Networks Simply an MLP to each time position with the same parameters: FFN(x) = max(0, xW1 + b1)W2 + b2 These can be seen as two Convolutions1D with kwidth = 1. The dimensionality of input and output is dmodel = 512 and inner layer has dimensionality dff = 2048. 20/37
  • 24. Point-Wise Feed Forward Networks 21/37
  • 25. The Transformer Block If we mix a spoon of Multi-Head Attention, another of Point-Wise FFN, a pinch of res-connections and a spoon of Add&LayerNorm ops we obtain the Transformer block: 22/37
  • 26. The Transformer Block We can see how N stacks of these blocks form the whole Transformer END-TO-END network. Note the extra enc-dec-attention in the decoder blocks. 23/37
  • 28. Embeddings As in seq2seq models, we use learned embeddings to convert input tokens and output tokens to dense vectors of dimension dmodel . There is also (of course) an output linear transformation to go from dmodel to number of classes and Softmax. In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 24/37
  • 29. Embeddings In the Transformer, all these 3 matrices are tied (same parameters apply), and in the embeddings layers weights are multiplied by 2 √ dmodel . 25/37
  • 31. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? 27/37
  • 32. Positional Encoding • Are we processing sequences? YES. • Are we taking care of this fact? NO. So let’s work it out. 28/37
  • 33. Positional Encoding • In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. • Add positional encodings joint with the embeddings, summing them up such that the positional info is merged in the input. PE(pos, 2i ) = sin( pos 10000 2i dmodel ) PE(pos, 2i+1) = cos( pos 10000 2i dmodel ) Where i is the dimension and pos the position (time-step). Each dimension corresponds to a sinusoid, with wavelengths forming a geometric progression. The frequency and offset of the wave is different for each dimension. 29/37
  • 34. Positional Encoding At every time-step we will have a combination of sinusoids telling us where are relative to the beginning (with combination of phases). Advantage of these codes: generalization to any length in test (cyclic nature of sinusoids rather than growing indefinitely). 30/37
  • 37. Results • On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big)) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU! (new SOTA of 28.4). • Training took 3.5 days on 8 P100 GPUs. Even their base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models. • On the WMT 2014 English-to-French translation task, the big model achieves a BLEU score of 41.0, outperforming all of the previously published single models, at less than 1 4 the training cost. 32/37
  • 43. Conclusions • The Transformer is the first sequence transduction model based entirely on attention ( replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention). • For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layer. • New SOTA on WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks. • Code used to train and evaluate original models is available at https://github.com/tensorflow/tensor2tensor. . 37/37