안녕하세요 딥러닝 논문 읽기 모임입니다. 오늘 업로드된 논문 리뷰 영상은 'Transformer Interpretability Beyond Attention Visualization'라는 제목의 논문입니다.
트랜스포머는 지금 까지 논문 리뷰 영상을 업로드 하면서 가장 많이 언급한 모델중 하나입니다. NLP를 넘어, 이미지 처리 매우 많은 영역에서 소타 네트워크로 쓰였습니다. 해당 논문은 이미지 처리 영역에서의 Transformer가 의사결정을 내리는 과정에 대해 특히 self Attention 모듈에 관해 다양한 방법으로 심층적으로 연구한 논문 입니다!
오늘 논문 리뷰를 위해 펀디멘탈팀 김채현님이 자세한리뷰 도와주셨습니다!
많은 관심 미리 감사드립니다!
https://youtu.be/XCED5bd2WT0
1. Fundamental Team
김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이근배, 이재윤, 조남경
Transformer Interpretability Beyond Attention Visualization
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2. C O N T E N T S
1 Introduction 3 Method
4 Performance Evaluation 5 Conclusion
3.1 Relevance and gradients
3.2 Non parametric relevance propagation
3.3 Relevance and gradient diffusion
3.4 Obtaining the image relevance map
2 Related Work
2.1 Explainability in computer vision
2.2 Explainability for Transformers
4.1 Setting
4.2 Results
3. 1 Introduction Transformer Interpretability Beyond Attention Visualization
Motivation
3
Transformers are currently the SOTA methods in almost all NLP benchmarks
The power of these methods has led to their adaptation in the filed of CV, RS
Importance of Transformer networks neccesitates tools for the visualization
of their decision process
1. Aid in debugging the models
2. Help verity that the models are fair and unbiased
3. Enable downstream tasks
Main building block of Transformer network : self-attention layers
assign pairwise attention value between every two tokens
NLP : token is typically a word
CV : each token can be associated with a patch
5. 1 Introduction Neural Machine Translation by Jointly Learning to Align and Translate
Attention
5
Visualization
- We can know what information each output referred to by using Attention weight
- For each output, we can know which word the output word weighted more among the input words to
perform attention
6. 1 Introduction Transformer Interpretability Beyond Attention Visualization
Rollout Method
Attention = relevancy score
6
1. follow the line of work that assigns relevancy
2. propagates it
such that the sum of relevancy is maintained through the layers
↓
A common practice when trying to visualize Transformer Model
: consider these attentions as a relevancy score
Usually done for a single attention layer
Attention is all you need (2017)
Show, attend and tell: Neural image caption generation with visual attention (2015)
End-to-End Object Detection with Transformers (2020)
Quantifying Attention Flow in Transformers (2020)
Reassigns all attention scores by considering the pairwise
attentions and assuming that attentions are combined linearly
into subsequent contexts
Improve results over the utilization of a single attention layer
relying on simplistic assumptions, irrelevant tokens often
become highlighted
7. Challenges
Transformer Network : heavily rely on skip connection and attention operator
(both involving the matrix of two activations maps, and each leading to unique challenges)
Transformers apply non-linearities other than ReLU → result in both positive features and negative features
(because of the non-positive values, skip connections lead, if not carefully handled, to numeric stabilities)
Self-attention layer form a challenge since a naï
ve propagation through these would not maintain the total
amount of relevancy
Contribution
1. Introduce a relevancy propagation rule that is applicable to both positive and negative attributions
2. Present a normalization term for non-parametric layers, such as “add”(e.g. skip-connection) and mat mul
3. Integrate the attention and relevancy score, and combine the integrated result for multiple attention blocks
1 Introduction
Proposed Method
7
Transformer Interpretability Beyond Attention Visualization
↓
8. 2 Related Work
Explainability in computer vision
Learning Deep Features for Discriminative Localization
Gradient based method
GradCAM
As an extension of CAM(Class Activation Map), GradCAM uses gradient(gradient in the backpropagation process) to
solve the problem of CAM that it can not be used in the case of a model without GAP(Global Average Pooling)
CAM (Class Activation Map)
𝐿𝑐
𝐶𝐴𝑀 𝑖, 𝑗 : Class c에 대한 CAM
𝑓𝑘(𝑖, 𝑗) : k번째 feature image
𝑤𝑐
𝑘 : k번째 feature image 𝑓𝑘(𝑖, 𝑗)에서
class c로 가는 weight
8
𝐿𝑐
𝐶𝐴𝑀 𝑖, 𝑗 =
𝑘
𝑤𝑐
𝑘𝑓𝑘(𝑖, 𝑗)
After the last convolutional layer, a fully-connected layer is
attached to the general CNN structure
For CAM calculation, the GAP layer is passed before the fully-
connected layer
After the GAP layer, a fully-connected layer connected to each
class is attached and fine-tuned
GAP : Average output of all values for each depth of the input
image feature map (extreme dimensional reduction)
9. 2 Related Work
Explainability in computer vision
Grad-CAM : Visual Explanations from Deep Networks via Gradient-based Localization
Gradient based method
GradCAM (Gradient-weighted CAM)
In GradCAM, the weight of the each feature map is given by Gradient
To make GradCAM, the gradient from the Convolutional layer is used.
The gradient is multiplied with the feature map of the convolutional layer, and the value passed through the ReLU
Activation Function is expressed as a heat-map on the input image
𝑎𝑐
𝑘 : k번째 feature map 𝑓𝑘(𝑖, 𝑗)의 각 원소
i, j가 Output class c의 matmul 값 𝑆𝑐
에 주는
영향력의 평균
9
𝐿𝑐
𝐺𝑟𝑎𝑑−𝐶𝐴𝑀 𝑖, 𝑗 = ReLU
𝑘
𝑎𝑐
𝑘𝑓𝑘 𝑖, 𝑗
𝑎𝑐
𝑘 =
1
𝑍
𝑖
𝑗
𝜕𝑆𝑐
𝜕𝑓𝑘 𝑖, 𝑗
10. 2 Related Work
Explainability in computer vision
Gradient based method
GradCAM (Gradient-weighted CAM)
Different heat-maps are created depending on which class GradCAM is shown
There are dogs and cats in the example, but there are differences as shown in (c), (i) depending on which class the
heat-map is to be drawn
Red indicates strong influence, and blue indicates less influence
In (e), (k), blue corresponds to the evidence of the class
10
Grad-CAM : Visual Explanations from Deep Networks via Gradient-based Localization
11. 2 Related Work
Explainability in computer vision
Explaining Decisions of Neural Networks by LRP
LRP (Layer-wise Relevance Propagation)
Understanding why the Neural Network model made such a decision through decomposition
If the input x consists of d dimension, it is assumed that each feature of d dimension has different influence in driving
the final output, and this relevance score is calculated and analyzed
Attribution propagation method
𝑥 : sample image
𝑓(𝑥) : image x에 대한 prediction
𝑅𝑖 : prediction 𝑓(𝑥)를 얻기 위해 image x의 각 pixel들이 기여하는 정도
각 차원의 relevance score
LRP result heatmap
Display the contribution(Relevance Score) of each pixel of image x
By looking at the top right(rooster’s beak or head) outputs the
prediction for x as “Rooster”
11
12. 2 Related Work
Explainability in computer vision
Explaining Decisions of Neural Networks by LRP
LRP (Layer-wise Relevance Propagation)
Redistribute the contribution of Relevance Score from output to input in a top-down manner
Basic assumptions of LRP
- Each neuron has a certain relevance
- Contribution is redistributed from the output of each neural to the input in a top-down manner
- Preservation of contribution when distributed
Not clear how to apply it directly to Transformer, and it is not good in terms of computational complexity and
performance
Attribution propagation method
12
13. 2 Related Work
Explainability in Transformers
Transformer Interpretability Beyond Attention Visualization
Not many contributions that explore the field of visualization for Transformer
: many contributions employ the attention scores themselves
This practice ignores most of the attention components, as well as the parts of the networks that perform
other types of computation
Self-attention head : involves the computation of queries, keys and values
Reducing it only to the obtained attention scores is myopic
Other layers are not even considered
Proposed Method : propagates through all layers from the decision back to input
Attention Score
↓
13
15. 3 Method
Proposed Method
Transformer Interpretability Beyond Attention Visualization
LRP-based relevance is used to calculate the score of the attention head in each layer in the Transformer
→ Integrates these scores through an attention graph
(Repeatedly erases the negative contribution, both relevancy and gradient information are considered)
→ Result : class-specific visualization for self-attention model
1. Relevance and gradients
2. Non parametric relevance propagation
3. Relevance and gradient diffusion
4. Obtaining the image relevance map
14
16. 3 Method
Relevance and Gradients
Transformer Interpretability Beyond Attention Visualization
Recalling the chain-rule, we propagate gradients with
respect to the classifier’s output y at class t, namely 𝑦𝑡
𝐶 : classification head의 수
𝑡 : 𝑡 ∈ 1 … |𝐶|로 표현되는 시각화 할 class
𝑥(𝑛)
: layer 𝐿(𝑛)
의 input
𝑛 : 𝑡 ∈ [1 … 𝑁] 으로 표현되는 𝑁 layers로 구성된 network의 index
𝑥(𝑁)
: network의 input
𝑥(1)
: network의 output
𝑦𝑡 : class 𝑡 에 대한 분류 결과
𝑗 : input 𝑥(𝑛)
의 index
𝑖 : input 𝑥(𝑛−1)
의 index
Relevance propagation follows the genetic Deep Taylor
Decomposition
𝐿 𝑛
(𝑋, 𝑌) : tensor X와 Y의 layer 연산
(전형적으로 이 두 tensor X와 Y는 layer 𝑛에 대한 input feature map과
weight)
𝑗 : 𝑅(𝑛)
의 index
𝑖 : 𝑅(𝑛−1)
의 index
satisfies the conservation rule
Gradients Relevance
15
∇𝑥𝑗
(𝑛)
≔
𝜕𝑦𝑡
𝜕𝑥𝑗
𝑛
=
𝑖
𝜕𝑦𝑡
𝜕𝑥𝑗
𝑛−1
𝜕𝑥𝑗
𝑛−1
𝜕𝑥𝑗
𝑛
𝑅𝑗
(𝑛)
= 𝐺 𝑋, 𝑌, 𝑅 𝑛−1
= σ𝑖 𝑋𝑗
𝜕𝐿𝑖
𝑛
(𝑋, 𝑌)
𝜕𝑥𝑗
𝑅𝑖
𝑛−1
𝐿𝑖
𝑛 (𝑋, 𝑌)
𝑗
𝑅𝑗
(𝑛)
=
𝑖
𝑅𝑖
(𝑛−1)
17. LRP assumes ReLU non-linearity activations, resulting in
non-negative feature maps, where the relevance
propagation rule can be defined as follows:
3 Method
Relevance and Gradients
Transformer Interpretability Beyond Attention Visualization
Non-linearities other than ReLU, such as GELU(Gaussian
Error Linear Units), output both positive and negative
values
To address this, LRP propagation on the left can be
modified by constructing a subset of indices 𝑞 =
𝑖, 𝑗 𝑥𝑗𝑤𝑗𝑖 ≥ 0}
Only the elements that have a positive weight relevance
are considered
↓
𝑋 = 𝑥 : layer의 input
𝑌 = 𝑤 : layer의 가중치
max(0, 𝑣) = 𝑣+
Non-positive values can be omitted because they are
assigned to 0 by ReLU
The notations of the previous Relevance equation are
broken down into elements
16
𝑅𝑗
(𝑛)
= 𝐺 𝑥+
, 𝑤+
, 𝑅(𝑛−1)
= σ𝑖
𝑥𝑗
+𝑤𝑗𝑖
+
σ𝑗′ 𝑥𝑗′
+𝑤𝑗′𝑖
+ 𝑅𝑗
(𝑛−1)
𝑅𝑗
(𝑛)
= 𝐺𝑞 𝑥, 𝑤, 𝑞, 𝑅 𝑛−1
=
{𝑖|(𝑖,𝑗)∈𝑞}
𝑥𝑗𝑤𝑗𝑖
σ{𝑗′|(𝑗′,𝑖)∈𝑞} 𝑥𝑗′𝑤𝑗′𝑖
𝑅𝑗
(𝑛−1)
18. 3 Method Transformer Interpretability Beyond Attention Visualization
Non parametric relevance propogation
There are two operators in Transformer model that involve mixing of two feature map tensors
1. skip connection
2. matrix multiplications (e.g. in attention modules)
The two operators require the propagation of relevance through both input tensors
Given two tensors u and v, we compute the relevance propagation of these binary operators as follows:
Since the sum of relevance scores is constant, in the case of addition, the conservation rule is preserved
17
𝑅𝑗
𝑢(𝑛)
, 𝑅𝑘
𝑣(𝑛)
: relevance for 𝑢 and 𝑣
When propagating relevance of skip connections, we encounter numerical instabilities
𝑅𝑗
𝑢(𝑛)
= 𝐺 𝑢, 𝑣, 𝑅 𝑛−1
, 𝑅𝑘
𝑣(𝑛)
= 𝐺 𝑣, 𝑢, 𝑅 𝑛−1
𝑗
𝑅𝑗
𝑢(𝑛)
+
𝑘
𝑅𝑘
𝑣(𝑛)
=
𝑖
𝑅𝑖
(𝑛−1)
19. 3 Method Transformer Interpretability Beyond Attention Visualization
Non parametric relevance propogation
To address the lack of conservation in the attention mechanism due to matrix multiplication, and the numerical
issues of skip connections, the proposed method applies a normalization 𝑅𝑗
𝑢(𝑛)
and 𝑅𝑘
𝑢(𝑛)
Following the conservation rule(σ𝑗 𝑅𝑗
(𝑛)
=σ𝑖 𝑅𝑖
(𝑛−1)
) and the initial relevance, we obtain σ𝑖 𝑅𝑖
(𝑛)
= 1 for each
layer n
18
1. maintains the conservation rule : σ𝑗 𝑅𝑗
𝑢(𝑛)
+ σ𝑘 𝑅𝑘
𝑣(𝑛)
= σ𝑖 𝑅𝑖
(𝑛−1)
2. bounds the relevance sum of each tensor such that : 0 ≤ σ𝑗 𝑅𝑗
𝑢(𝑛)
, σ𝑘 𝑅𝑘
𝑣(𝑛)
≤ σ𝑖 𝑅𝑖
(𝑛−1)
20. Transformer Model 𝑀
input : sequence of 𝑠 tokens, each of dimension 𝑑 (with special token for classification [CLS])
output : classification probability vector 𝑦 of length 𝐶 (computed using the classification token)
self-attention modules operates on a small 𝑑ℎ of the embedding dimension 𝑑, such that ℎ𝑑ℎ = 𝑑
3 Method
Relevance and Gradient Diffusion
Transformer Interpretability Beyond Attention Visualization
𝑀 : Transformer model
𝐵 : Transformer model 내 block 개수
𝑏 ∈ 𝐵𝑖 {1, … , 𝐵}} 라 할 때, 각 block b는 self-attention, skip connections, additional linear layer, normalization layers로 구성
𝑠 : Transformer model이 받는 token의 길이
𝑑 : Transformer model이 받는 token의 차원
ℎ : “heads”의 개수
𝑂(𝑏)
: block 𝑏의 attention module의 output (ℎ × 𝑠 × 𝑑ℎ 차원)
𝑄(𝑏)
, 𝐾(𝑏)
, 𝑉(𝑏)
: block 𝑏의 query, key, value input 값 = input 𝑥(𝑛)
을 self-attention에 돌리기 위한 projected된 값 (ℎ × 𝑠 × 𝑑ℎ 차원)
𝐴(𝑏)
: block 𝑏에 대한 attention map (각 row 𝑖는 token 𝑖에 관한 모든 tokens의 attention coefficients)
softmax가 적용되기 때문에 𝐴(𝑏)
의 각 attention head의 row의 합은 1 (ℎ × 𝑠 × 𝑠 차원)
19
21. In 𝐴(𝑏)
, each head playes a different role, and referring to the row i of the attention map for the head, it can be seen
that the pair-wise probability distribution of other tokens for the i-th token(pixel)
Based on the process, the relevance propagation 𝑅(𝑛𝑏)
and gradients 𝛻𝐴(𝑏)
of each attention map 𝐴(𝑏)
can be obtained
The final output 𝐶 ∈ ℝ𝑠×𝑠
is then defined by the weighted attention relevance
3 Method
Relevance and Gradient Diffusion
Transformer Interpretability Beyond Attention Visualization
20
𝛻𝐴(𝑏)
: attention map 𝐴(𝑏)
의 gradients
𝑅(𝑛𝑏)
: attention map 𝐴(𝑏)
의 relevance = layer 𝑛𝑏의 relevance
𝑛𝑏 : block 𝑏의 softmax operation에 해당하는 layer
⨀ : Hadamard product
𝔼ℎ : “heads” dimension에 따른 평균
In order to compute the weight attention relevance, only the
positive value of the gradients-relevance multiplication is
considered
To account for the skip connections in the Transformer block,
identity matrix was added to avoid self inhibition for each token
22. 3 Method Transformer Interpretability Beyond Attention Visualization
Obtaining the image relevance map
Result matrix of the proposed method 𝐶 : size 𝑠 × 𝑠 (𝑠 : the sequence length of input fed to the Transformer)
Each row corresponds to a relevance map for each token given the other tokens
Since this study is related to classification, only [CLS] token containing a description of classification is considered
So [CLS] token extract the relevance map from row 𝐶[𝐶𝐿𝑆] ∈ ℝ𝑠
This row include a score that calculates the influence of each token on classification token
* CLS : Special Classification token
Usually, CLS token has a combined meaning of token sequence after going through Transformer, which can be easily
classified by attaching a classifier
21
24. 4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization
Setting
22
실험 데이터 세트 :
ImageNet (ILSVRC) 2012 (50K images from 1000 classes)
ImageNet-Segmentation (annotated subset of ImageNet)
성능 측정 방법 :
positive and negative perturbation → AUC results (percents)
segmentation performance → pixel accuracy, mAP (mean Average Precision), mIoU (Mask Intersection-over-Union)
성능 비교 대상 :
attention map → rollout, raw-attention
class-specific → GradCAM (Gradient-weighted Class Activation Map)
relevance propagation → LRP (Layer-wise Relevance Propagation)
25. 4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization
Qualitative Evaluation
Result : Qualitative Evaluation
23
Sample results. As can be seen, our method produces more accurate visualization
Class-specific visualization
For each image we present results for two different classes
Visual comparison of various baseline and proposed method
Baseline produces inconsistent visualization, while the result of the
proposed method provide clearer and more consistent visualization
Image with 2 classes
All methods except GradCAM create similar visualizations for each
class, while proposed method provides two different accurate
visualization
26. 4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization
Result : Perturbation Test, Segmentation
24
Segmentation
As a result of conducting an experiment using the ImageNet-
segmentation dataset : segmentation metrics (pixel-accuracy, mAP,
and mIoU)
The proposed method surpasses all baseline by a significant
difference
Segmentation performance on the ImageNet-segmentation dataset (percent)
Higher is better
Perturbation Tests
AUC results obtained through negative and positive perturbation
tests for both predicted and target class are shown
The proposed method achieves better result by a large difference
in both experiments
(rollout, raw attention generates a certain visualization with the
input image specified, so the result is excluded from the target-
class test)
Positive and Negative perturbation AUC results (percents) for the predicted and
target classes, on ImageNet validation set. For positive perturbation lower is better,
and for negative perturbation higher is better
27. 5 Conclusion
• Suggest a method to interpret and visualize the decision-making process of the Transformer model
• Provide a specific solution for each challenge
: Use a relevance propagation rule that can be used for positive and negative attribution
In places such as skip connection, normalization term for non-parametric layer is used
Integrate attention and relevancy scores, and combine the combined results to create multiple attention block
• State-of-the-art result when compared with other Trasformer interpretation method
Transformer Interpretability Beyond Attention Visualization
25
• Literature on the interpretability of the Transformer↓
: use of non-positive activation functions
frequent use of skip connections
challenge of modeling the matrix multiplication that is used in self-attention