Transformer Introduction (Seminar Material)

Introduction of Transformer
Lab Meeting Material
Yuta Niki
Master 1st
Izumi Lab. UTokyo

This Material’s Objective
◼Transformer and its advanced models(BERT) show
high performance!
◼Experiments with those models are necessary in
NLP×Deep Learning research.
◼First Step (in this slide)
• Learn basic knowledge of Attention
• Understand the architecture of Transformer
◼Next Step (in the future)
• Fine-Tuning for Sentiment Analysis, etc.
• Learn BERT, etc.
※In the last slide, reference materials are collected. You should read them.
※This is written in English because an international student came to the Lab.
2

What is “Transformer”?
◼Paper
• “Attention Is All You Need”[1]
◼Motivation
• Build a model with sufficient representation power for difficult
task(←translation task in the paper)
• Train a model efficiently in parallel(RNN cannot train in parallel)
◼Methods and Results
• Architecture with attention mechanism without RNN
• Less time to train
• Achieve great BLEU score in the translation task
◼Application
• Use Encoder that have acquired strong representation power
for other tasks by fine-tuning.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
3

Transformer’s structure
◼Encoder(Left)
• Stack of 6 layers
• self-attention + feed-forward network(FFN)
◼Decoder(Right)
• Stack of 6 layers
• self-attention + SourceTarget-att + FFN
◼Components
• Positional Encoding
• Multi-Head Attention
• Position-wise Feed-Forward Network
• Residual Connection
◼Regularization
• Residual Dropout
• Label Smoothing
• Attention Dropout
4

Positional Encoding
◼Proposed in “End-To-End Memory Network”[1]
◼Motivation
• Add information about the position of the words in the
sentences(←transformer don’t contain RNN and CNN)
𝑑 𝑚𝑜𝑑𝑒𝑙: the dim. of word embedding
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = 𝑠𝑖𝑛(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = 𝑐𝑜𝑠(
𝑝𝑜𝑠
100002𝑖/𝑑 𝑚𝑜𝑑𝑒𝑙
)
Where 𝑝𝑜𝑠 is the position and 𝑖 is the dimension.
[1] Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
5

Scaled Dot-Product Attention
Attention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
where
𝑄 𝑄 ∈ ℝ 𝑛×𝑑 𝑘 : query matrix
𝐾 𝐾 ∈ ℝ 𝑛×𝑑 𝑘 : key matrix
𝑉 𝑉 ∈ ℝ 𝑛×𝑑 𝑣 : value matrix
𝑛: length of sentence
𝑑 𝑘: dim. of queries and keys
𝑑 𝑣: dim. of values
6

2 Types of Attention
• Additive Attention[1]
𝐴𝑡𝑡 𝐻
= softmax 𝑊𝐻 + 𝑏
• Dot-Product Attention[2,3]
𝐴𝑡𝑡 𝑄, 𝐾, 𝑉
= softmax 𝑄𝐾 𝑇 𝑉
[1] Bahdanau, Dzmitry, et al. “Neural Machine Translation by Jointly Learning to Align an Translate.” ICLR, 2015.
[2] Miller, Alexander, et al. “Key-Value Memory Networks for Directly Reading Documents.” EMNLP, 2016.
[3] Daniluk, Michal, et al. “Frustratingly Short Attention Spans in Neural Language Modeling.” ICLR, 2017.
In Transformer, Dot-Product Attention is Used.
7

Why Use Scaled Dot-Product Attention?
◼Dot-Product Attention is faster and more
efficient than Additive Attention.
• Additive Attention use a feed-forward network as the
compatibility function.
• Dot-Product Attention can be implemented using highly
optimized matrix multiplication code.
◼Use scaling term
1
𝑑 𝑘
to make Dot-Product
Attention high performance with large 𝑑 𝑘
• Additive Attention outperforms Dot-Product Attention
without scaling for larger values of 𝑑 𝑘 [1]
[1] Britz, Denny, et al. “Massive Exploration of Neural Machine Translation Architectures." EMNLP, 2017.
8

Source-Target or Self Attention
◼2 types of Dot-Product Attention
• Source-Target Attention
➢Used in the 2nd Multi-Head Attention Layer of Transformer
Decoder Layer
• Self-Attention
➢Used in the Multi-Head Attention Layer of Transformer
Encoder Layer and the 1st one of Transformer Decoder Layer
◼What is the difference?
• Depends on where query comes from.
➢query from Encoder → Self-Att.
➢query from Decoder → Source-Target Att.
𝐾 𝑉𝑞𝑢𝑒𝑟𝑦𝜎
from Encoder
from Encoder → Self
from Decoder → Target
9

Multi-Head Attention
MultiHead 𝑄, 𝐾, 𝑉 = Concat ℎ𝑒𝑎𝑑1, … , ℎ𝑒𝑎𝑑ℎ 𝑊 𝑂
where ℎ𝑒𝑎𝑑𝑖 = Attention(𝑄𝑊𝑖
𝑊
, 𝐾𝑊𝑖
𝐾
, 𝑉𝑊𝑖
𝑉
)
where 𝑊𝑖
𝑄
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘, 𝑊𝑖
𝐾
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑘,
𝑊𝑖
𝑉
∈ ℝ 𝑑 𝑚𝑜𝑑𝑒𝑙×𝑑 𝑣 and 𝑊 𝑂
∈ ℝℎ𝑑 𝑣×𝑑 𝑚𝑜𝑑𝑒𝑙.
ℎ: # of parallel attention layers
𝑑 𝑘 = 𝑑 𝑣 = 𝑑 𝑚𝑜𝑑𝑒𝑙/ℎ .
⇒Attention with Dropout
Attention 𝑄, 𝐾, 𝑉 = dropout softmax
𝑄𝐾 𝑇
𝑑 𝑘
𝑉
10

Why Multi-Head Attention?
Experiments(In Table 3 (a)) shows that multi-head
attention model outperforms single-head attention.
“Multi-Head Attention allows the model to jointly
attend to information from different representation
subspaces at difference positions.”[1]
Multi-Head Attention seems
ensemble of attention.
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
11

What Multi-Head Attention Learns
◼Learn the importance of the relationship
between words regardless of their distance
• In the figure below, the relationship between
“making” and “difficult” is strong in many Attention.
12Cite from (http://deeplearning.hatenablog.com/entry/transformer)

FFN and Residual Connection
◼Point-wise Feed-Forward Network
FFN 𝑥 = ReLU 𝑥𝑊1 + 𝑏1 𝑊2 + 𝑏2
where
𝑑 𝑓𝑓(= 2048): dim. of the inner-layer
◼Residual Connection
LayerNorm(𝑥 + Sublayer(𝑥))
⇒Residual Dropout
LayerNorm(𝑥 + Drouput(Sublayer 𝑥 , droprate))
13

Very Thanks for Great Predecessors
◼Summary blogs helped my understanding m(_ _)m
• 論文解説 Attention Is All You Need (Transformer)
➢Commentary including background knowledge necessary for
full understanding
• 論文読み "Attention Is All You Need“
➢Help understand the flow of data in Transformer
• The Annotated Transformer(harvardnlp)
➢PyTorch implementation and corresponding parts of the paper
are explained simply.
• 作って理解する Transformer / Attention
➢I cannot understand how to calculate 𝑄, 𝐾 and 𝑉 in Dot-
Product Attention from paper. This page shows one solution.
14

Transformer Introduction (Seminar Material)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Transformer Introduction (Seminar Material)

Ähnlich wie Transformer Introduction (Seminar Material) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Transformer Introduction (Seminar Material)