SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
Introduction of Transformer
Lab Meeting Material
Yuta Niki
Master 1st
Izumi Lab. UTokyo
This Material’s Objective
◼Transformer and its advanced models(BERT) show
high performance!
◼Experiments with those models are necessary in
NLP×Deep Learning research.
◼First Step (in this slide)
• Learn basic knowledge of Attention
• Understand the architecture of Transformer
◼Next Step (in the future)
• Fine-Tuning for Sentiment Analysis, etc.
• Learn BERT, etc.
※In the last slide, reference materials are collected. You should read them.
※This is written in English because an international student came to the Lab.
2
What is “Transformer”?
◼Paper
• “Attention Is All You Need”[1]
◼Motivation
• Build a model with sufficient representation power for difficult
task(←translation task in the paper)
• Train a model efficiently in parallel(RNN cannot train in parallel)
◼Methods and Results
• Architecture with attention mechanism without RNN
• Less time to train
• Achieve great BLEU score in the translation task
◼Application
• Use Encoder that have acquired strong representation power
for other tasks by fine-tuning.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
3
Transformer’s structure
◼Encoder(Left)
• Stack of 6 layers
• self-attention + feed-forward network(FFN)
◼Decoder(Right)
• Stack of 6 layers
• self-attention + SourceTarget-att + FFN
◼Components
• Positional Encoding
• Multi-Head Attention
• Position-wise Feed-Forward Network
• Residual Connection
◼Regularization
• Residual Dropout
• Label Smoothing
• Attention Dropout
4
Positional Encoding
◼Proposed in “End-To-End Memory Network”[1]
◼Motivation
• Add information about the position of the words in the
sentences(←transformer don’t contain RNN and CNN)
𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension.
[1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
5
Scaled Dot-Product Attention
Attention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where
𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix
𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix
𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix
𝑛: length of sentence
𝑑 𝑘: dim. of queries and keys
𝑑 𝑣: dim. of values
6
2 Types of Attention
• Additive Attention[1]
𝐴𝑡𝑡 𝐻
= softmax 𝑊𝐻 + 𝑏
• Dot-Product Attention[2,3]
𝐴𝑡𝑡 𝑄, 𝐾, 𝑉
= softmax 𝑄𝐾 𝑇 𝑉
[1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015.
[2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016.
[3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017.
In Transformer, Dot-Product Attention is Used.
7
Why Use Scaled Dot-Product Attention?
◼Dot-Product Attention is faster and more
efficient than Additive Attention.
• Additive Attention use a feed-forward network as the
compatibility function.
• Dot-Product Attention can be implemented using highly
optimized matrix multiplication code.
◼Use scaling term
1
𝑑 𝑘
to make Dot-Product
Attention high performance with large 𝑑 𝑘
• Additive Attention outperforms Dot-Product Attention
without scaling for larger values of 𝑑 𝑘 [1]
[1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017.
8
Source-Target or Self Attention
◼2 types of Dot-Product Attention
• Source-Target Attention
➢Used in the 2nd Multi-Head Attention Layer of Transformer
Decoder Layer
• Self-Attention
➢Used in the Multi-Head Attention Layer of Transformer
Encoder Layer and the 1st one of Transformer Decoder Layer
◼What is the difference?
• Depends on where query comes from.
➢query from Encoder → Self-Att.
➢query from Decoder → Source-Target Att.
𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎
from Encoder
from Encoder → Self
from Decoder → Target
9
Multi-Head Attention
MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖
𝑊
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
)
where 𝑊𝑖
𝑄
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖
𝐾
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘,
𝑊𝑖
𝑉
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂
∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙.
ℎ: # of parallel attention layers
𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ .
⇒Attention with Dropout
Attention 𝑄, 𝐾, 𝑉 = dropout softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
10
Why Multi-Head Attention?
Experiments(In Table 3 (a)) shows that multi-head
attention model outperforms single-head attention.
“Multi-Head Attention allows the model to jointly
attend to information from different representation
subspaces at difference positions.”[1]
Multi-Head Attention seems
ensemble of attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
11
What Multi-Head Attention Learns
◼Learn the importance of the relationship
between words regardless of their distance
• In the figure below, the relationship between
“making” and “difficult” is strong in many Attention.
12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
FFN and Residual Connection
◼Point-wise Feed-Forward Network
FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
where
𝑑 𝑓𝑓(= 2048): dim. of the inner-layer
◼Residual Connection
LayerNorm(𝑥 + Sublayer(𝑥))
⇒Residual Dropout
LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate))
13
Very Thanks for Great Predecessors
◼Summary blogs helped my understanding m(_ _)m
• 論文解説 Attention Is All You Need (Transformer)
➢Commentary including background knowledge necessary for
full understanding
• 論文読み "Attention Is All You Need“
➢Help understand the flow of data in Transformer
• The Annotated Transformer(harvardnlp)
➢PyTorch implementation and corresponding parts of the paper
are explained simply.
• 作って理解する Transformer / Attention
➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot-
Product Attention from paper. This page shows one solution.
14

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Transformers
TransformersTransformers
Transformers
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Bert
BertBert
Bert
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Transformer in Computer Vision
Transformer in Computer VisionTransformer in Computer Vision
Transformer in Computer Vision
 
[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx[AIoTLab]attention mechanism.pptx
[AIoTLab]attention mechanism.pptx
 
Attention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its ApplicationsAttention Mechanism in Language Understanding and its Applications
Attention Mechanism in Language Understanding and its Applications
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Word embedding
Word embedding Word embedding
Word embedding
 

Ähnlich wie Transformer Introduction (Seminar Material)

LearningAG.ppt
LearningAG.pptLearningAG.ppt
LearningAG.ppt
butest
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
MdMahfoozAlam5
 

Ähnlich wie Transformer Introduction (Seminar Material) (20)

EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...Muhammad Usman Akhtar  |  Ph.D Scholar  |  Wuhan  University  |  School of Co...
Muhammad Usman Akhtar | Ph.D Scholar | Wuhan University | School of Co...
 
Monotonic Multihead Attention review
Monotonic Multihead Attention reviewMonotonic Multihead Attention review
Monotonic Multihead Attention review
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
Iterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer PredictionIterative Multi-document Neural Attention for Multiple Answer Prediction
Iterative Multi-document Neural Attention for Multiple Answer Prediction
 
LearningAG.ppt
LearningAG.pptLearningAG.ppt
LearningAG.ppt
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers help
 
Cost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention ProcessCost-effective Interactive Attention Learning with Neural Attention Process
Cost-effective Interactive Attention Learning with Neural Attention Process
 
Agile leadership practices for PIONEERS
 Agile leadership practices for PIONEERS Agile leadership practices for PIONEERS
Agile leadership practices for PIONEERS
 
Methodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniquesMethodological study of opinion mining and sentiment analysis techniques
Methodological study of opinion mining and sentiment analysis techniques
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Multi-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptxMulti-Task Learning in Deep Neural Networks.pptx
Multi-Task Learning in Deep Neural Networks.pptx
 
A Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer ModelA Multiscale Visualization of Attention in the Transformer Model
A Multiscale Visualization of Attention in the Transformer Model
 
Deep learning.pptx
Deep learning.pptxDeep learning.pptx
Deep learning.pptx
 
Compressing Neural Networks with Intel AI Lab's Distiller
Compressing Neural Networks with Intel AI Lab's DistillerCompressing Neural Networks with Intel AI Lab's Distiller
Compressing Neural Networks with Intel AI Lab's Distiller
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
IRJET- Machine Learning V/S Deep Learning
IRJET- Machine Learning V/S Deep LearningIRJET- Machine Learning V/S Deep Learning
IRJET- Machine Learning V/S Deep Learning
 

Kürzlich hochgeladen

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Transformer Introduction (Seminar Material)

  • 1. Introduction of Transformer Lab Meeting Material Yuta Niki Master 1st Izumi Lab. UTokyo
  • 2. This Material’s Objective ◼Transformer and its advanced models(BERT) show high performance! ◼Experiments with those models are necessary in NLP×Deep Learning research. ◼First Step (in this slide) • Learn basic knowledge of Attention • Understand the architecture of Transformer ◼Next Step (in the future) • Fine-Tuning for Sentiment Analysis, etc. • Learn BERT, etc. ※In the last slide, reference materials are collected. You should read them. ※This is written in English because an international student came to the Lab. 2
  • 3. What is “Transformer”? ◼Paper • “Attention Is All You Need”[1] ◼Motivation • Build a model with sufficient representation power for difficult task(←translation task in the paper) • Train a model efficiently in parallel(RNN cannot train in parallel) ◼Methods and Results • Architecture with attention mechanism without RNN • Less time to train • Achieve great BLEU score in the translation task ◼Application • Use Encoder that have acquired strong representation power for other tasks by fine-tuning. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 3
  • 4. Transformer’s structure ◼Encoder(Left) • Stack of 6 layers • self-attention + feed-forward network(FFN) ◼Decoder(Right) • Stack of 6 layers • self-attention + SourceTarget-att + FFN ◼Components • Positional Encoding • Multi-Head Attention • Position-wise Feed-Forward Network • Residual Connection ◼Regularization • Residual Dropout • Label Smoothing • Attention Dropout 4
  • 5. Positional Encoding ◼Proposed in “End-To-End Memory Network”[1] ◼Motivation • Add information about the position of the words in the sentences(←transformer don’t contain RNN and CNN) 𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding 𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛( 𝑝𝑜𝑠 100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙 ) 𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠( 𝑝𝑜𝑠 100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙 ) Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension. [1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015. 5
  • 6. Scaled Dot-Product Attention Attention 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 where 𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix 𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix 𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix 𝑛: length of sentence 𝑑 𝑘: dim. of queries and keys 𝑑 𝑣: dim. of values 6
  • 7. 2 Types of Attention • Additive Attention[1] 𝐴𝑡𝑡 𝐻 = softmax 𝑊𝐻 + 𝑏 • Dot-Product Attention[2,3] 𝐴𝑡𝑡 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑇 𝑉 [1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015. [2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016. [3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017. In Transformer, Dot-Product Attention is Used. 7
  • 8. Why Use Scaled Dot-Product Attention? ◼Dot-Product Attention is faster and more efficient than Additive Attention. • Additive Attention use a feed-forward network as the compatibility function. • Dot-Product Attention can be implemented using highly optimized matrix multiplication code. ◼Use scaling term 1 𝑑 𝑘 to make Dot-Product Attention high performance with large 𝑑 𝑘 • Additive Attention outperforms Dot-Product Attention without scaling for larger values of 𝑑 𝑘 [1] [1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017. 8
  • 9. Source-Target or Self Attention ◼2 types of Dot-Product Attention • Source-Target Attention ➢Used in the 2nd Multi-Head Attention Layer of Transformer Decoder Layer • Self-Attention ➢Used in the Multi-Head Attention Layer of Transformer Encoder Layer and the 1st one of Transformer Decoder Layer ◼What is the difference? • Depends on where query comes from. ➢query from Encoder → Self-Att. ➢query from Decoder → Source-Target Att. 𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎 from Encoder from Encoder → Self from Decoder → Target 9
  • 10. Multi-Head Attention MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂 where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖 𝑊 , 𝐾𝑊𝑖 𝐾 , 𝑉𝑊𝑖 𝑉 ) where 𝑊𝑖 𝑄 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖 𝐾 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖 𝑉 ∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂 ∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙. ℎ: # of parallel attention layers 𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ . ⇒Attention with Dropout Attention 𝑄, 𝐾, 𝑉 = dropout softmax 𝑄𝐾 𝑇 𝑑 𝑘 𝑉 10
  • 11. Why Multi-Head Attention? Experiments(In Table 3 (a)) shows that multi-head attention model outperforms single-head attention. “Multi-Head Attention allows the model to jointly attend to information from different representation subspaces at difference positions.”[1] Multi-Head Attention seems ensemble of attention. [1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017. 11
  • 12. What Multi-Head Attention Learns ◼Learn the importance of the relationship between words regardless of their distance • In the figure below, the relationship between “making” and “difficult” is strong in many Attention. 12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
  • 13. FFN and Residual Connection ◼Point-wise Feed-Forward Network FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2 where 𝑑 𝑓𝑓(= 2048): dim. of the inner-layer ◼Residual Connection LayerNorm(𝑥 + Sublayer(𝑥)) ⇒Residual Dropout LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate)) 13
  • 14. Very Thanks for Great Predecessors ◼Summary blogs helped my understanding m(_ _)m • 論文解説 Attention Is All You Need (Transformer) ➢Commentary including background knowledge necessary for full understanding • 論文読み "Attention Is All You Need“ ➢Help understand the flow of data in Transformer • The Annotated Transformer(harvardnlp) ➢PyTorch implementation and corresponding parts of the paper are explained simply. • 作って理解する Transformer / Attention ➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot- Product Attention from paper. This page shows one solution. 14