最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
2. This Material’s Objective
◼Transformer and its advanced models(BERT) show
high performance!
◼Experiments with those models are necessary in
NLP×Deep Learning research.
◼First Step (in this slide)
• Learn basic knowledge of Attention
• Understand the architecture of Transformer
◼Next Step (in the future)
• Fine-Tuning for Sentiment Analysis, etc.
• Learn BERT, etc.
※In the last slide, reference materials are collected. You should read them.
※This is written in English because an international student came to the Lab.
2
3. What is “Transformer”?
◼Paper
• “Attention Is All You Need”[1]
◼Motivation
• Build a model with sufficient representation power for difficult
task(←translation task in the paper)
• Train a model efficiently in parallel(RNN cannot train in parallel)
◼Methods and Results
• Architecture with attention mechanism without RNN
• Less time to train
• Achieve great BLEU score in the translation task
◼Application
• Use Encoder that have acquired strong representation power
for other tasks by fine-tuning.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
3
5. Positional Encoding
◼Proposed in “End-To-End Memory Network”[1]
◼Motivation
• Add information about the position of the words in the
sentences(←transformer don’t contain RNN and CNN)
𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension.
[1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
5
6. Scaled Dot-Product Attention
Attention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where
𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix
𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix
𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix
𝑛: length of sentence
𝑑 𝑘: dim. of queries and keys
𝑑 𝑣: dim. of values
6
7. 2 Types of Attention
• Additive Attention[1]
𝐴𝑡𝑡 𝐻
= softmax 𝑊𝐻 + 𝑏
• Dot-Product Attention[2,3]
𝐴𝑡𝑡 𝑄, 𝐾, 𝑉
= softmax 𝑄𝐾 𝑇 𝑉
[1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015.
[2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016.
[3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017.
In Transformer, Dot-Product Attention is Used.
7
8. Why Use Scaled Dot-Product Attention?
◼Dot-Product Attention is faster and more
efficient than Additive Attention.
• Additive Attention use a feed-forward network as the
compatibility function.
• Dot-Product Attention can be implemented using highly
optimized matrix multiplication code.
◼Use scaling term
1
𝑑 𝑘
to make Dot-Product
Attention high performance with large 𝑑 𝑘
• Additive Attention outperforms Dot-Product Attention
without scaling for larger values of 𝑑 𝑘 [1]
[1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017.
8
9. Source-Target or Self Attention
◼2 types of Dot-Product Attention
• Source-Target Attention
➢Used in the 2nd Multi-Head Attention Layer of Transformer
Decoder Layer
• Self-Attention
➢Used in the Multi-Head Attention Layer of Transformer
Encoder Layer and the 1st one of Transformer Decoder Layer
◼What is the difference?
• Depends on where query comes from.
➢query from Encoder → Self-Att.
➢query from Decoder → Source-Target Att.
𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎
from Encoder
from Encoder → Self
from Decoder → Target
9
11. Why Multi-Head Attention?
Experiments(In Table 3 (a)) shows that multi-head
attention model outperforms single-head attention.
“Multi-Head Attention allows the model to jointly
attend to information from different representation
subspaces at difference positions.”[1]
Multi-Head Attention seems
ensemble of attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
11
12. What Multi-Head Attention Learns
◼Learn the importance of the relationship
between words regardless of their distance
• In the figure below, the relationship between
“making” and “difficult” is strong in many Attention.
12Cite from (http://deeplearning.hatenablog.com/entry/transformer)
13. FFN and Residual Connection
◼Point-wise Feed-Forward Network
FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
where
𝑑 𝑓𝑓(= 2048): dim. of the inner-layer
◼Residual Connection
LayerNorm(𝑥 + Sublayer(𝑥))
⇒Residual Dropout
LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate))
13
14. Very Thanks for Great Predecessors
◼Summary blogs helped my understanding m(_ _)m
• 論文解説 Attention Is All You Need (Transformer)
➢Commentary including background knowledge necessary for
full understanding
• 論文読み "Attention Is All You Need“
➢Help understand the flow of data in Transformer
• The Annotated Transformer(harvardnlp)
➢PyTorch implementation and corresponding parts of the paper
are explained simply.
• 作って理解する Transformer / Attention
➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot-
Product Attention from paper. This page shows one solution.
14