SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Fundamental Team
김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이근배, 이재윤, 조남경
Transformer Interpretability Beyond Attention Visualization
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
C O N T E N T S
1 Introduction 3 Method
4 Performance Evaluation 5 Conclusion
3.1 Relevance and gradients
3.2 Non parametric relevance propagation
3.3 Relevance and gradient diffusion
3.4 Obtaining the image relevance map
2 Related Work
2.1 Explainability in computer vision
2.2 Explainability for Transformers
4.1 Setting
4.2 Results
1 Introduction Transformer Interpretability Beyond Attention Visualization
Motivation
3
Transformers are currently the SOTA methods in almost all NLP benchmarks
The power of these methods has led to their adaptation in the filed of CV, RS
Importance of Transformer networks neccesitates tools for the visualization
of their decision process
1. Aid in debugging the models
2. Help verity that the models are fair and unbiased
3. Enable downstream tasks
Main building block of Transformer network : self-attention layers
assign pairwise attention value between every two tokens
NLP : token is typically a word
CV : each token can be associated with a patch
1 Introduction Attention is All You Need
Transformer
4
Self Attention :
Multi-Head Attention
𝑆𝐴 𝑞, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾𝑇
𝑑𝑘
𝑉
MatMul
𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾𝑇
𝑑𝑘
𝑉
𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑄𝐾𝑇
𝑑𝑘
𝑄𝐾𝑇
𝑑𝑘
𝑓 𝐾, 𝑄 = 𝑄𝐾𝑇
(𝐾 = 𝐾𝑊𝐾
, 𝑄 = 𝑄𝑊𝑄
, 𝑉 = 𝑉𝑊𝑉
)
Scale
MatMul
Mask (opt.)
Softmax
1 Introduction Neural Machine Translation by Jointly Learning to Align and Translate
Attention
5
Visualization
- We can know what information each output referred to by using Attention weight
- For each output, we can know which word the output word weighted more among the input words to
perform attention
1 Introduction Transformer Interpretability Beyond Attention Visualization
Rollout Method
Attention = relevancy score
6
1. follow the line of work that assigns relevancy
2. propagates it
such that the sum of relevancy is maintained through the layers
↓
A common practice when trying to visualize Transformer Model
: consider these attentions as a relevancy score
Usually done for a single attention layer
Attention is all you need (2017)
Show, attend and tell: Neural image caption generation with visual attention (2015)
End-to-End Object Detection with Transformers (2020)
Quantifying Attention Flow in Transformers (2020)
Reassigns all attention scores by considering the pairwise
attentions and assuming that attentions are combined linearly
into subsequent contexts
Improve results over the utilization of a single attention layer
relying on simplistic assumptions, irrelevant tokens often
become highlighted
Challenges
Transformer Network : heavily rely on skip connection and attention operator
(both involving the matrix of two activations maps, and each leading to unique challenges)
Transformers apply non-linearities other than ReLU → result in both positive features and negative features
(because of the non-positive values, skip connections lead, if not carefully handled, to numeric stabilities)
Self-attention layer form a challenge since a naï
ve propagation through these would not maintain the total
amount of relevancy
Contribution
1. Introduce a relevancy propagation rule that is applicable to both positive and negative attributions
2. Present a normalization term for non-parametric layers, such as “add”(e.g. skip-connection) and mat mul
3. Integrate the attention and relevancy score, and combine the integrated result for multiple attention blocks
1 Introduction
Proposed Method
7
Transformer Interpretability Beyond Attention Visualization
↓
2 Related Work
Explainability in computer vision
Learning Deep Features for Discriminative Localization
Gradient based method
GradCAM
As an extension of CAM(Class Activation Map), GradCAM uses gradient(gradient in the backpropagation process) to
solve the problem of CAM that it can not be used in the case of a model without GAP(Global Average Pooling)
CAM (Class Activation Map)
𝐿𝑐
𝐶𝐴𝑀 𝑖, 𝑗 : Class c에 대한 CAM
𝑓𝑘(𝑖, 𝑗) : k번째 feature image
𝑤𝑐
𝑘 : k번째 feature image 𝑓𝑘(𝑖, 𝑗)에서
class c로 가는 weight
8
𝐿𝑐
𝐶𝐴𝑀 𝑖, 𝑗 = ෍
𝑘
𝑤𝑐
𝑘𝑓𝑘(𝑖, 𝑗)
After the last convolutional layer, a fully-connected layer is
attached to the general CNN structure
For CAM calculation, the GAP layer is passed before the fully-
connected layer
After the GAP layer, a fully-connected layer connected to each
class is attached and fine-tuned
GAP : Average output of all values for each depth of the input
image feature map (extreme dimensional reduction)
2 Related Work
Explainability in computer vision
Grad-CAM : Visual Explanations from Deep Networks via Gradient-based Localization
Gradient based method
GradCAM (Gradient-weighted CAM)
In GradCAM, the weight of the each feature map is given by Gradient
To make GradCAM, the gradient from the Convolutional layer is used.
The gradient is multiplied with the feature map of the convolutional layer, and the value passed through the ReLU
Activation Function is expressed as a heat-map on the input image
𝑎𝑐
𝑘 : k번째 feature map 𝑓𝑘(𝑖, 𝑗)의 각 원소
i, j가 Output class c의 matmul 값 𝑆𝑐
에 주는
영향력의 평균
9
𝐿𝑐
𝐺𝑟𝑎𝑑−𝐶𝐴𝑀 𝑖, 𝑗 = ReLU ෍
𝑘
𝑎𝑐
𝑘𝑓𝑘 𝑖, 𝑗
𝑎𝑐
𝑘 =
1
𝑍
෍
𝑖
෍
𝑗
𝜕𝑆𝑐
𝜕𝑓𝑘 𝑖, 𝑗
2 Related Work
Explainability in computer vision
Gradient based method
GradCAM (Gradient-weighted CAM)
Different heat-maps are created depending on which class GradCAM is shown
There are dogs and cats in the example, but there are differences as shown in (c), (i) depending on which class the
heat-map is to be drawn
Red indicates strong influence, and blue indicates less influence
In (e), (k), blue corresponds to the evidence of the class
10
Grad-CAM : Visual Explanations from Deep Networks via Gradient-based Localization
2 Related Work
Explainability in computer vision
Explaining Decisions of Neural Networks by LRP
LRP (Layer-wise Relevance Propagation)
Understanding why the Neural Network model made such a decision through decomposition
If the input x consists of d dimension, it is assumed that each feature of d dimension has different influence in driving
the final output, and this relevance score is calculated and analyzed
Attribution propagation method
𝑥 : sample image
𝑓(𝑥) : image x에 대한 prediction
𝑅𝑖 : prediction 𝑓(𝑥)를 얻기 위해 image x의 각 pixel들이 기여하는 정도
각 차원의 relevance score
LRP result heatmap
Display the contribution(Relevance Score) of each pixel of image x
By looking at the top right(rooster’s beak or head) outputs the
prediction for x as “Rooster”
11
2 Related Work
Explainability in computer vision
Explaining Decisions of Neural Networks by LRP
LRP (Layer-wise Relevance Propagation)
Redistribute the contribution of Relevance Score from output to input in a top-down manner
Basic assumptions of LRP
- Each neuron has a certain relevance
- Contribution is redistributed from the output of each neural to the input in a top-down manner
- Preservation of contribution when distributed
Not clear how to apply it directly to Transformer, and it is not good in terms of computational complexity and
performance
Attribution propagation method
12
2 Related Work
Explainability in Transformers
Transformer Interpretability Beyond Attention Visualization
Not many contributions that explore the field of visualization for Transformer
: many contributions employ the attention scores themselves
This practice ignores most of the attention components, as well as the parts of the networks that perform
other types of computation
Self-attention head : involves the computation of queries, keys and values
Reducing it only to the obtained attention scores is myopic
Other layers are not even considered
Proposed Method : propagates through all layers from the decision back to input
Attention Score
↓
13
Q & A
3 Method
Proposed Method
Transformer Interpretability Beyond Attention Visualization
LRP-based relevance is used to calculate the score of the attention head in each layer in the Transformer
→ Integrates these scores through an attention graph
(Repeatedly erases the negative contribution, both relevancy and gradient information are considered)
→ Result : class-specific visualization for self-attention model
1. Relevance and gradients
2. Non parametric relevance propagation
3. Relevance and gradient diffusion
4. Obtaining the image relevance map
14
3 Method
Relevance and Gradients
Transformer Interpretability Beyond Attention Visualization
Recalling the chain-rule, we propagate gradients with
respect to the classifier’s output y at class t, namely 𝑦𝑡
𝐶 : classification head의 수
𝑡 : 𝑡 ∈ 1 … |𝐶|로 표현되는 시각화 할 class
𝑥(𝑛)
: layer 𝐿(𝑛)
의 input
𝑛 : 𝑡 ∈ [1 … 𝑁] 으로 표현되는 𝑁 layers로 구성된 network의 index
𝑥(𝑁)
: network의 input
𝑥(1)
: network의 output
𝑦𝑡 : class 𝑡 에 대한 분류 결과
𝑗 : input 𝑥(𝑛)
의 index
𝑖 : input 𝑥(𝑛−1)
의 index
Relevance propagation follows the genetic Deep Taylor
Decomposition
𝐿 𝑛
(𝑋, 𝑌) : tensor X와 Y의 layer 연산
(전형적으로 이 두 tensor X와 Y는 layer 𝑛에 대한 input feature map과
weight)
𝑗 : 𝑅(𝑛)
의 index
𝑖 : 𝑅(𝑛−1)
의 index
satisfies the conservation rule
Gradients Relevance
15
∇𝑥𝑗
(𝑛)
≔
𝜕𝑦𝑡
𝜕𝑥𝑗
𝑛
= ෍
𝑖
𝜕𝑦𝑡
𝜕𝑥𝑗
𝑛−1
𝜕𝑥𝑗
𝑛−1
𝜕𝑥𝑗
𝑛
𝑅𝑗
(𝑛)
= 𝐺 𝑋, 𝑌, 𝑅 𝑛−1
= σ𝑖 𝑋𝑗
𝜕𝐿𝑖
𝑛
(𝑋, 𝑌)
𝜕𝑥𝑗
𝑅𝑖
𝑛−1
𝐿𝑖
𝑛 (𝑋, 𝑌)
෍
𝑗
𝑅𝑗
(𝑛)
= ෍
𝑖
𝑅𝑖
(𝑛−1)
LRP assumes ReLU non-linearity activations, resulting in
non-negative feature maps, where the relevance
propagation rule can be defined as follows:
3 Method
Relevance and Gradients
Transformer Interpretability Beyond Attention Visualization
Non-linearities other than ReLU, such as GELU(Gaussian
Error Linear Units), output both positive and negative
values
To address this, LRP propagation on the left can be
modified by constructing a subset of indices 𝑞 =
𝑖, 𝑗 𝑥𝑗𝑤𝑗𝑖 ≥ 0}
Only the elements that have a positive weight relevance
are considered
↓
𝑋 = 𝑥 : layer의 input
𝑌 = 𝑤 : layer의 가중치
max(0, 𝑣) = 𝑣+
Non-positive values can be omitted because they are
assigned to 0 by ReLU
The notations of the previous Relevance equation are
broken down into elements
16
𝑅𝑗
(𝑛)
= 𝐺 𝑥+
, 𝑤+
, 𝑅(𝑛−1)
= σ𝑖
𝑥𝑗
+𝑤𝑗𝑖
+
σ𝑗′ 𝑥𝑗′
+𝑤𝑗′𝑖
+ 𝑅𝑗
(𝑛−1)
𝑅𝑗
(𝑛)
= 𝐺𝑞 𝑥, 𝑤, 𝑞, 𝑅 𝑛−1
= ෍
{𝑖|(𝑖,𝑗)∈𝑞}
𝑥𝑗𝑤𝑗𝑖
σ{𝑗′|(𝑗′,𝑖)∈𝑞} 𝑥𝑗′𝑤𝑗′𝑖
𝑅𝑗
(𝑛−1)
3 Method Transformer Interpretability Beyond Attention Visualization
Non parametric relevance propogation
There are two operators in Transformer model that involve mixing of two feature map tensors
1. skip connection
2. matrix multiplications (e.g. in attention modules)
The two operators require the propagation of relevance through both input tensors
Given two tensors u and v, we compute the relevance propagation of these binary operators as follows:
Since the sum of relevance scores is constant, in the case of addition, the conservation rule is preserved
17
𝑅𝑗
𝑢(𝑛)
, 𝑅𝑘
𝑣(𝑛)
: relevance for 𝑢 and 𝑣
When propagating relevance of skip connections, we encounter numerical instabilities
𝑅𝑗
𝑢(𝑛)
= 𝐺 𝑢, 𝑣, 𝑅 𝑛−1
, 𝑅𝑘
𝑣(𝑛)
= 𝐺 𝑣, 𝑢, 𝑅 𝑛−1
෍
𝑗
𝑅𝑗
𝑢(𝑛)
+ ෍
𝑘
𝑅𝑘
𝑣(𝑛)
= ෍
𝑖
𝑅𝑖
(𝑛−1)
3 Method Transformer Interpretability Beyond Attention Visualization
Non parametric relevance propogation
To address the lack of conservation in the attention mechanism due to matrix multiplication, and the numerical
issues of skip connections, the proposed method applies a normalization 𝑅𝑗
𝑢(𝑛)
and 𝑅𝑘
𝑢(𝑛)
Following the conservation rule(σ𝑗 𝑅𝑗
(𝑛)
=σ𝑖 𝑅𝑖
(𝑛−1)
) and the initial relevance, we obtain σ𝑖 𝑅𝑖
(𝑛)
= 1 for each
layer n
18
1. maintains the conservation rule : σ𝑗 𝑅𝑗
𝑢(𝑛)
+ σ𝑘 𝑅𝑘
𝑣(𝑛)
= σ𝑖 𝑅𝑖
(𝑛−1)
2. bounds the relevance sum of each tensor such that : 0 ≤ σ𝑗 𝑅𝑗
𝑢(𝑛)
, σ𝑘 𝑅𝑘
𝑣(𝑛)
≤ σ𝑖 𝑅𝑖
(𝑛−1)
Transformer Model 𝑀
input : sequence of 𝑠 tokens, each of dimension 𝑑 (with special token for classification [CLS])
output : classification probability vector 𝑦 of length 𝐶 (computed using the classification token)
self-attention modules operates on a small 𝑑ℎ of the embedding dimension 𝑑, such that ℎ𝑑ℎ = 𝑑
3 Method
Relevance and Gradient Diffusion
Transformer Interpretability Beyond Attention Visualization
𝑀 : Transformer model
𝐵 : Transformer model 내 block 개수
𝑏 ∈ 𝐵𝑖 {1, … , 𝐵}} 라 할 때, 각 block b는 self-attention, skip connections, additional linear layer, normalization layers로 구성
𝑠 : Transformer model이 받는 token의 길이
𝑑 : Transformer model이 받는 token의 차원
ℎ : “heads”의 개수
𝑂(𝑏)
: block 𝑏의 attention module의 output (ℎ × 𝑠 × 𝑑ℎ 차원)
𝑄(𝑏)
, 𝐾(𝑏)
, 𝑉(𝑏)
: block 𝑏의 query, key, value input 값 = input 𝑥(𝑛)
을 self-attention에 돌리기 위한 projected된 값 (ℎ × 𝑠 × 𝑑ℎ 차원)
𝐴(𝑏)
: block 𝑏에 대한 attention map (각 row 𝑖는 token 𝑖에 관한 모든 tokens의 attention coefficients)
softmax가 적용되기 때문에 𝐴(𝑏)
의 각 attention head의 row의 합은 1 (ℎ × 𝑠 × 𝑠 차원)
19
In 𝐴(𝑏)
, each head playes a different role, and referring to the row i of the attention map for the head, it can be seen
that the pair-wise probability distribution of other tokens for the i-th token(pixel)
Based on the process, the relevance propagation 𝑅(𝑛𝑏)
and gradients 𝛻𝐴(𝑏)
of each attention map 𝐴(𝑏)
can be obtained
The final output 𝐶 ∈ ℝ𝑠×𝑠
is then defined by the weighted attention relevance
3 Method
Relevance and Gradient Diffusion
Transformer Interpretability Beyond Attention Visualization
20
𝛻𝐴(𝑏)
: attention map 𝐴(𝑏)
의 gradients
𝑅(𝑛𝑏)
: attention map 𝐴(𝑏)
의 relevance = layer 𝑛𝑏의 relevance
𝑛𝑏 : block 𝑏의 softmax operation에 해당하는 layer
⨀ : Hadamard product
𝔼ℎ : “heads” dimension에 따른 평균
In order to compute the weight attention relevance, only the
positive value of the gradients-relevance multiplication is
considered
To account for the skip connections in the Transformer block,
identity matrix was added to avoid self inhibition for each token
3 Method Transformer Interpretability Beyond Attention Visualization
Obtaining the image relevance map
Result matrix of the proposed method 𝐶 : size 𝑠 × 𝑠 (𝑠 : the sequence length of input fed to the Transformer)
Each row corresponds to a relevance map for each token given the other tokens
Since this study is related to classification, only [CLS] token containing a description of classification is considered
So [CLS] token extract the relevance map from row 𝐶[𝐶𝐿𝑆] ∈ ℝ𝑠
This row include a score that calculates the influence of each token on classification token
* CLS : Special Classification token
Usually, CLS token has a combined meaning of token sequence after going through Transformer, which can be easily
classified by attaching a classifier
21
Q & A
4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization
Setting
22
실험 데이터 세트 :
ImageNet (ILSVRC) 2012 (50K images from 1000 classes)
ImageNet-Segmentation (annotated subset of ImageNet)
성능 측정 방법 :
positive and negative perturbation → AUC results (percents)
segmentation performance → pixel accuracy, mAP (mean Average Precision), mIoU (Mask Intersection-over-Union)
성능 비교 대상 :
attention map → rollout, raw-attention
class-specific → GradCAM (Gradient-weighted Class Activation Map)
relevance propagation → LRP (Layer-wise Relevance Propagation)
4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization
Qualitative Evaluation
Result : Qualitative Evaluation
23
Sample results. As can be seen, our method produces more accurate visualization
Class-specific visualization
For each image we present results for two different classes
Visual comparison of various baseline and proposed method
Baseline produces inconsistent visualization, while the result of the
proposed method provide clearer and more consistent visualization
Image with 2 classes
All methods except GradCAM create similar visualizations for each
class, while proposed method provides two different accurate
visualization
4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization
Result : Perturbation Test, Segmentation
24
Segmentation
As a result of conducting an experiment using the ImageNet-
segmentation dataset : segmentation metrics (pixel-accuracy, mAP,
and mIoU)
The proposed method surpasses all baseline by a significant
difference
Segmentation performance on the ImageNet-segmentation dataset (percent)
Higher is better
Perturbation Tests
AUC results obtained through negative and positive perturbation
tests for both predicted and target class are shown
The proposed method achieves better result by a large difference
in both experiments
(rollout, raw attention generates a certain visualization with the
input image specified, so the result is excluded from the target-
class test)
Positive and Negative perturbation AUC results (percents) for the predicted and
target classes, on ImageNet validation set. For positive perturbation lower is better,
and for negative perturbation higher is better
5 Conclusion
• Suggest a method to interpret and visualize the decision-making process of the Transformer model
• Provide a specific solution for each challenge
: Use a relevance propagation rule that can be used for positive and negative attribution
In places such as skip connection, normalization term for non-parametric layer is used
Integrate attention and relevancy scores, and combine the combined results to create multiple attention block
• State-of-the-art result when compared with other Trasformer interpretation method
Transformer Interpretability Beyond Attention Visualization
25
• Literature on the interpretability of the Transformer↓
: use of non-positive activation functions
frequent use of skip connections
challenge of modeling the matrix multiplication that is used in self-attention
Thank you ツ

Weitere ähnliche Inhalte

Was ist angesagt?

PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Krishnaram Kenthapadi
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSangwoo Mo
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
強化学習の実適用に向けた課題と工夫
強化学習の実適用に向けた課題と工夫強化学習の実適用に向けた課題と工夫
強化学習の実適用に向けた課題と工夫Masahiro Yasumoto
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and EditingDeep Learning JP
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Krishnaram Kenthapadi
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22GiacomoBalloccu
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language SupervisionDeep Learning JP
 

Was ist angesagt? (20)

PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)Explainable AI in Industry (KDD 2019 Tutorial)
Explainable AI in Industry (KDD 2019 Tutorial)
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
CNN Quantization
CNN QuantizationCNN Quantization
CNN Quantization
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Self-supervised Learning Lecture Note
Self-supervised Learning Lecture NoteSelf-supervised Learning Lecture Note
Self-supervised Learning Lecture Note
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
強化学習の実適用に向けた課題と工夫
強化学習の実適用に向けた課題と工夫強化学習の実適用に向けた課題と工夫
強化学習の実適用に向けた課題と工夫
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing[DL輪読会]GLIDE: Guided Language to Image Diffusion  for Generation and Editing
[DL輪読会]GLIDE: Guided Language to Image Diffusion for Generation and Editing
 
The Origin of Grad-CAM
The Origin of Grad-CAMThe Origin of Grad-CAM
The Origin of Grad-CAM
 
Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)Explainable AI in Industry (WWW 2020 Tutorial)
Explainable AI in Industry (WWW 2020 Tutorial)
 
Randomized smoothing
Randomized smoothingRandomized smoothing
Randomized smoothing
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
 
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
 

Ähnlich wie 220206 transformer interpretability beyond attention visualization

Introduction to Grad-CAM (short version)
Introduction to Grad-CAM (short version)Introduction to Grad-CAM (short version)
Introduction to Grad-CAM (short version)Hsing-chuan Hsieh
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxcongtran88
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Florent Renucci
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical EquationsIRJET Journal
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Hsing-chuan Hsieh
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...CSCJournals
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...Waqas Tariq
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcscpconf
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructioncsandit
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONcsandit
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
4213ijaia05
4213ijaia054213ijaia05
4213ijaia05ijaia
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptxthanhdowork
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
2021 03-02-transformer interpretability
2021 03-02-transformer interpretability2021 03-02-transformer interpretability
2021 03-02-transformer interpretabilityJAEMINJEONG5
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)RCCSRENKEI
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Static Analysis of Computer programs
Static Analysis of Computer programs Static Analysis of Computer programs
Static Analysis of Computer programs Arvind Devaraj
 

Ähnlich wie 220206 transformer interpretability beyond attention visualization (20)

Introduction to Grad-CAM (short version)
Introduction to Grad-CAM (short version)Introduction to Grad-CAM (short version)
Introduction to Grad-CAM (short version)
 
Reading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptxReading_0413_var_Transformers.pptx
Reading_0413_var_Transformers.pptx
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Scene understanding
Scene understandingScene understanding
Scene understanding
 
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
Manifold Blurring Mean Shift algorithms for manifold denoising, report, 2012
 
Recognition of Handwritten Mathematical Equations
Recognition of  Handwritten Mathematical EquationsRecognition of  Handwritten Mathematical Equations
Recognition of Handwritten Mathematical Equations
 
Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)Introduction to Grad-CAM (complete version)
Introduction to Grad-CAM (complete version)
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
The Positive Effects of Fuzzy C-Means Clustering on Supervised Learning Class...
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Median based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstructionMedian based parallel steering kernel regression for image reconstruction
Median based parallel steering kernel regression for image reconstruction
 
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTIONMEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
MEDIAN BASED PARALLEL STEERING KERNEL REGRESSION FOR IMAGE RECONSTRUCTION
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
4213ijaia05
4213ijaia054213ijaia05
4213ijaia05
 
240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx240318_JW_labseminar[Attention Is All You Need].pptx
240318_JW_labseminar[Attention Is All You Need].pptx
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
2021 03-02-transformer interpretability
2021 03-02-transformer interpretability2021 03-02-transformer interpretability
2021 03-02-transformer interpretability
 
第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)第13回 配信講義 計算科学技術特論A(2021)
第13回 配信講義 計算科学技術特論A(2021)
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Static Analysis of Computer programs
Static Analysis of Computer programs Static Analysis of Computer programs
Static Analysis of Computer programs
 

Mehr von taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimizationtaeseon ryu
 

Mehr von taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
 

Kürzlich hochgeladen

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Kürzlich hochgeladen (20)

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

220206 transformer interpretability beyond attention visualization

  • 1. Fundamental Team 김동현, 김채현, 박종익, 송헌, 양현모, 오대환, 이근배, 이재윤, 조남경 Transformer Interpretability Beyond Attention Visualization Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
  • 2. C O N T E N T S 1 Introduction 3 Method 4 Performance Evaluation 5 Conclusion 3.1 Relevance and gradients 3.2 Non parametric relevance propagation 3.3 Relevance and gradient diffusion 3.4 Obtaining the image relevance map 2 Related Work 2.1 Explainability in computer vision 2.2 Explainability for Transformers 4.1 Setting 4.2 Results
  • 3. 1 Introduction Transformer Interpretability Beyond Attention Visualization Motivation 3 Transformers are currently the SOTA methods in almost all NLP benchmarks The power of these methods has led to their adaptation in the filed of CV, RS Importance of Transformer networks neccesitates tools for the visualization of their decision process 1. Aid in debugging the models 2. Help verity that the models are fair and unbiased 3. Enable downstream tasks Main building block of Transformer network : self-attention layers assign pairwise attention value between every two tokens NLP : token is typically a word CV : each token can be associated with a patch
  • 4. 1 Introduction Attention is All You Need Transformer 4 Self Attention : Multi-Head Attention 𝑆𝐴 𝑞, 𝐾, 𝑉 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 𝑉 MatMul 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 𝑉 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑄𝐾𝑇 𝑑𝑘 𝑄𝐾𝑇 𝑑𝑘 𝑓 𝐾, 𝑄 = 𝑄𝐾𝑇 (𝐾 = 𝐾𝑊𝐾 , 𝑄 = 𝑄𝑊𝑄 , 𝑉 = 𝑉𝑊𝑉 ) Scale MatMul Mask (opt.) Softmax
  • 5. 1 Introduction Neural Machine Translation by Jointly Learning to Align and Translate Attention 5 Visualization - We can know what information each output referred to by using Attention weight - For each output, we can know which word the output word weighted more among the input words to perform attention
  • 6. 1 Introduction Transformer Interpretability Beyond Attention Visualization Rollout Method Attention = relevancy score 6 1. follow the line of work that assigns relevancy 2. propagates it such that the sum of relevancy is maintained through the layers ↓ A common practice when trying to visualize Transformer Model : consider these attentions as a relevancy score Usually done for a single attention layer Attention is all you need (2017) Show, attend and tell: Neural image caption generation with visual attention (2015) End-to-End Object Detection with Transformers (2020) Quantifying Attention Flow in Transformers (2020) Reassigns all attention scores by considering the pairwise attentions and assuming that attentions are combined linearly into subsequent contexts Improve results over the utilization of a single attention layer relying on simplistic assumptions, irrelevant tokens often become highlighted
  • 7. Challenges Transformer Network : heavily rely on skip connection and attention operator (both involving the matrix of two activations maps, and each leading to unique challenges) Transformers apply non-linearities other than ReLU → result in both positive features and negative features (because of the non-positive values, skip connections lead, if not carefully handled, to numeric stabilities) Self-attention layer form a challenge since a naï ve propagation through these would not maintain the total amount of relevancy Contribution 1. Introduce a relevancy propagation rule that is applicable to both positive and negative attributions 2. Present a normalization term for non-parametric layers, such as “add”(e.g. skip-connection) and mat mul 3. Integrate the attention and relevancy score, and combine the integrated result for multiple attention blocks 1 Introduction Proposed Method 7 Transformer Interpretability Beyond Attention Visualization ↓
  • 8. 2 Related Work Explainability in computer vision Learning Deep Features for Discriminative Localization Gradient based method GradCAM As an extension of CAM(Class Activation Map), GradCAM uses gradient(gradient in the backpropagation process) to solve the problem of CAM that it can not be used in the case of a model without GAP(Global Average Pooling) CAM (Class Activation Map) 𝐿𝑐 𝐶𝐴𝑀 𝑖, 𝑗 : Class c에 대한 CAM 𝑓𝑘(𝑖, 𝑗) : k번째 feature image 𝑤𝑐 𝑘 : k번째 feature image 𝑓𝑘(𝑖, 𝑗)에서 class c로 가는 weight 8 𝐿𝑐 𝐶𝐴𝑀 𝑖, 𝑗 = ෍ 𝑘 𝑤𝑐 𝑘𝑓𝑘(𝑖, 𝑗) After the last convolutional layer, a fully-connected layer is attached to the general CNN structure For CAM calculation, the GAP layer is passed before the fully- connected layer After the GAP layer, a fully-connected layer connected to each class is attached and fine-tuned GAP : Average output of all values for each depth of the input image feature map (extreme dimensional reduction)
  • 9. 2 Related Work Explainability in computer vision Grad-CAM : Visual Explanations from Deep Networks via Gradient-based Localization Gradient based method GradCAM (Gradient-weighted CAM) In GradCAM, the weight of the each feature map is given by Gradient To make GradCAM, the gradient from the Convolutional layer is used. The gradient is multiplied with the feature map of the convolutional layer, and the value passed through the ReLU Activation Function is expressed as a heat-map on the input image 𝑎𝑐 𝑘 : k번째 feature map 𝑓𝑘(𝑖, 𝑗)의 각 원소 i, j가 Output class c의 matmul 값 𝑆𝑐 에 주는 영향력의 평균 9 𝐿𝑐 𝐺𝑟𝑎𝑑−𝐶𝐴𝑀 𝑖, 𝑗 = ReLU ෍ 𝑘 𝑎𝑐 𝑘𝑓𝑘 𝑖, 𝑗 𝑎𝑐 𝑘 = 1 𝑍 ෍ 𝑖 ෍ 𝑗 𝜕𝑆𝑐 𝜕𝑓𝑘 𝑖, 𝑗
  • 10. 2 Related Work Explainability in computer vision Gradient based method GradCAM (Gradient-weighted CAM) Different heat-maps are created depending on which class GradCAM is shown There are dogs and cats in the example, but there are differences as shown in (c), (i) depending on which class the heat-map is to be drawn Red indicates strong influence, and blue indicates less influence In (e), (k), blue corresponds to the evidence of the class 10 Grad-CAM : Visual Explanations from Deep Networks via Gradient-based Localization
  • 11. 2 Related Work Explainability in computer vision Explaining Decisions of Neural Networks by LRP LRP (Layer-wise Relevance Propagation) Understanding why the Neural Network model made such a decision through decomposition If the input x consists of d dimension, it is assumed that each feature of d dimension has different influence in driving the final output, and this relevance score is calculated and analyzed Attribution propagation method 𝑥 : sample image 𝑓(𝑥) : image x에 대한 prediction 𝑅𝑖 : prediction 𝑓(𝑥)를 얻기 위해 image x의 각 pixel들이 기여하는 정도 각 차원의 relevance score LRP result heatmap Display the contribution(Relevance Score) of each pixel of image x By looking at the top right(rooster’s beak or head) outputs the prediction for x as “Rooster” 11
  • 12. 2 Related Work Explainability in computer vision Explaining Decisions of Neural Networks by LRP LRP (Layer-wise Relevance Propagation) Redistribute the contribution of Relevance Score from output to input in a top-down manner Basic assumptions of LRP - Each neuron has a certain relevance - Contribution is redistributed from the output of each neural to the input in a top-down manner - Preservation of contribution when distributed Not clear how to apply it directly to Transformer, and it is not good in terms of computational complexity and performance Attribution propagation method 12
  • 13. 2 Related Work Explainability in Transformers Transformer Interpretability Beyond Attention Visualization Not many contributions that explore the field of visualization for Transformer : many contributions employ the attention scores themselves This practice ignores most of the attention components, as well as the parts of the networks that perform other types of computation Self-attention head : involves the computation of queries, keys and values Reducing it only to the obtained attention scores is myopic Other layers are not even considered Proposed Method : propagates through all layers from the decision back to input Attention Score ↓ 13
  • 14. Q & A
  • 15. 3 Method Proposed Method Transformer Interpretability Beyond Attention Visualization LRP-based relevance is used to calculate the score of the attention head in each layer in the Transformer → Integrates these scores through an attention graph (Repeatedly erases the negative contribution, both relevancy and gradient information are considered) → Result : class-specific visualization for self-attention model 1. Relevance and gradients 2. Non parametric relevance propagation 3. Relevance and gradient diffusion 4. Obtaining the image relevance map 14
  • 16. 3 Method Relevance and Gradients Transformer Interpretability Beyond Attention Visualization Recalling the chain-rule, we propagate gradients with respect to the classifier’s output y at class t, namely 𝑦𝑡 𝐶 : classification head의 수 𝑡 : 𝑡 ∈ 1 … |𝐶|로 표현되는 시각화 할 class 𝑥(𝑛) : layer 𝐿(𝑛) 의 input 𝑛 : 𝑡 ∈ [1 … 𝑁] 으로 표현되는 𝑁 layers로 구성된 network의 index 𝑥(𝑁) : network의 input 𝑥(1) : network의 output 𝑦𝑡 : class 𝑡 에 대한 분류 결과 𝑗 : input 𝑥(𝑛) 의 index 𝑖 : input 𝑥(𝑛−1) 의 index Relevance propagation follows the genetic Deep Taylor Decomposition 𝐿 𝑛 (𝑋, 𝑌) : tensor X와 Y의 layer 연산 (전형적으로 이 두 tensor X와 Y는 layer 𝑛에 대한 input feature map과 weight) 𝑗 : 𝑅(𝑛) 의 index 𝑖 : 𝑅(𝑛−1) 의 index satisfies the conservation rule Gradients Relevance 15 ∇𝑥𝑗 (𝑛) ≔ 𝜕𝑦𝑡 𝜕𝑥𝑗 𝑛 = ෍ 𝑖 𝜕𝑦𝑡 𝜕𝑥𝑗 𝑛−1 𝜕𝑥𝑗 𝑛−1 𝜕𝑥𝑗 𝑛 𝑅𝑗 (𝑛) = 𝐺 𝑋, 𝑌, 𝑅 𝑛−1 = σ𝑖 𝑋𝑗 𝜕𝐿𝑖 𝑛 (𝑋, 𝑌) 𝜕𝑥𝑗 𝑅𝑖 𝑛−1 𝐿𝑖 𝑛 (𝑋, 𝑌) ෍ 𝑗 𝑅𝑗 (𝑛) = ෍ 𝑖 𝑅𝑖 (𝑛−1)
  • 17. LRP assumes ReLU non-linearity activations, resulting in non-negative feature maps, where the relevance propagation rule can be defined as follows: 3 Method Relevance and Gradients Transformer Interpretability Beyond Attention Visualization Non-linearities other than ReLU, such as GELU(Gaussian Error Linear Units), output both positive and negative values To address this, LRP propagation on the left can be modified by constructing a subset of indices 𝑞 = 𝑖, 𝑗 𝑥𝑗𝑤𝑗𝑖 ≥ 0} Only the elements that have a positive weight relevance are considered ↓ 𝑋 = 𝑥 : layer의 input 𝑌 = 𝑤 : layer의 가중치 max(0, 𝑣) = 𝑣+ Non-positive values can be omitted because they are assigned to 0 by ReLU The notations of the previous Relevance equation are broken down into elements 16 𝑅𝑗 (𝑛) = 𝐺 𝑥+ , 𝑤+ , 𝑅(𝑛−1) = σ𝑖 𝑥𝑗 +𝑤𝑗𝑖 + σ𝑗′ 𝑥𝑗′ +𝑤𝑗′𝑖 + 𝑅𝑗 (𝑛−1) 𝑅𝑗 (𝑛) = 𝐺𝑞 𝑥, 𝑤, 𝑞, 𝑅 𝑛−1 = ෍ {𝑖|(𝑖,𝑗)∈𝑞} 𝑥𝑗𝑤𝑗𝑖 σ{𝑗′|(𝑗′,𝑖)∈𝑞} 𝑥𝑗′𝑤𝑗′𝑖 𝑅𝑗 (𝑛−1)
  • 18. 3 Method Transformer Interpretability Beyond Attention Visualization Non parametric relevance propogation There are two operators in Transformer model that involve mixing of two feature map tensors 1. skip connection 2. matrix multiplications (e.g. in attention modules) The two operators require the propagation of relevance through both input tensors Given two tensors u and v, we compute the relevance propagation of these binary operators as follows: Since the sum of relevance scores is constant, in the case of addition, the conservation rule is preserved 17 𝑅𝑗 𝑢(𝑛) , 𝑅𝑘 𝑣(𝑛) : relevance for 𝑢 and 𝑣 When propagating relevance of skip connections, we encounter numerical instabilities 𝑅𝑗 𝑢(𝑛) = 𝐺 𝑢, 𝑣, 𝑅 𝑛−1 , 𝑅𝑘 𝑣(𝑛) = 𝐺 𝑣, 𝑢, 𝑅 𝑛−1 ෍ 𝑗 𝑅𝑗 𝑢(𝑛) + ෍ 𝑘 𝑅𝑘 𝑣(𝑛) = ෍ 𝑖 𝑅𝑖 (𝑛−1)
  • 19. 3 Method Transformer Interpretability Beyond Attention Visualization Non parametric relevance propogation To address the lack of conservation in the attention mechanism due to matrix multiplication, and the numerical issues of skip connections, the proposed method applies a normalization 𝑅𝑗 𝑢(𝑛) and 𝑅𝑘 𝑢(𝑛) Following the conservation rule(σ𝑗 𝑅𝑗 (𝑛) =σ𝑖 𝑅𝑖 (𝑛−1) ) and the initial relevance, we obtain σ𝑖 𝑅𝑖 (𝑛) = 1 for each layer n 18 1. maintains the conservation rule : σ𝑗 𝑅𝑗 𝑢(𝑛) + σ𝑘 𝑅𝑘 𝑣(𝑛) = σ𝑖 𝑅𝑖 (𝑛−1) 2. bounds the relevance sum of each tensor such that : 0 ≤ σ𝑗 𝑅𝑗 𝑢(𝑛) , σ𝑘 𝑅𝑘 𝑣(𝑛) ≤ σ𝑖 𝑅𝑖 (𝑛−1)
  • 20. Transformer Model 𝑀 input : sequence of 𝑠 tokens, each of dimension 𝑑 (with special token for classification [CLS]) output : classification probability vector 𝑦 of length 𝐶 (computed using the classification token) self-attention modules operates on a small 𝑑ℎ of the embedding dimension 𝑑, such that ℎ𝑑ℎ = 𝑑 3 Method Relevance and Gradient Diffusion Transformer Interpretability Beyond Attention Visualization 𝑀 : Transformer model 𝐵 : Transformer model 내 block 개수 𝑏 ∈ 𝐵𝑖 {1, … , 𝐵}} 라 할 때, 각 block b는 self-attention, skip connections, additional linear layer, normalization layers로 구성 𝑠 : Transformer model이 받는 token의 길이 𝑑 : Transformer model이 받는 token의 차원 ℎ : “heads”의 개수 𝑂(𝑏) : block 𝑏의 attention module의 output (ℎ × 𝑠 × 𝑑ℎ 차원) 𝑄(𝑏) , 𝐾(𝑏) , 𝑉(𝑏) : block 𝑏의 query, key, value input 값 = input 𝑥(𝑛) 을 self-attention에 돌리기 위한 projected된 값 (ℎ × 𝑠 × 𝑑ℎ 차원) 𝐴(𝑏) : block 𝑏에 대한 attention map (각 row 𝑖는 token 𝑖에 관한 모든 tokens의 attention coefficients) softmax가 적용되기 때문에 𝐴(𝑏) 의 각 attention head의 row의 합은 1 (ℎ × 𝑠 × 𝑠 차원) 19
  • 21. In 𝐴(𝑏) , each head playes a different role, and referring to the row i of the attention map for the head, it can be seen that the pair-wise probability distribution of other tokens for the i-th token(pixel) Based on the process, the relevance propagation 𝑅(𝑛𝑏) and gradients 𝛻𝐴(𝑏) of each attention map 𝐴(𝑏) can be obtained The final output 𝐶 ∈ ℝ𝑠×𝑠 is then defined by the weighted attention relevance 3 Method Relevance and Gradient Diffusion Transformer Interpretability Beyond Attention Visualization 20 𝛻𝐴(𝑏) : attention map 𝐴(𝑏) 의 gradients 𝑅(𝑛𝑏) : attention map 𝐴(𝑏) 의 relevance = layer 𝑛𝑏의 relevance 𝑛𝑏 : block 𝑏의 softmax operation에 해당하는 layer ⨀ : Hadamard product 𝔼ℎ : “heads” dimension에 따른 평균 In order to compute the weight attention relevance, only the positive value of the gradients-relevance multiplication is considered To account for the skip connections in the Transformer block, identity matrix was added to avoid self inhibition for each token
  • 22. 3 Method Transformer Interpretability Beyond Attention Visualization Obtaining the image relevance map Result matrix of the proposed method 𝐶 : size 𝑠 × 𝑠 (𝑠 : the sequence length of input fed to the Transformer) Each row corresponds to a relevance map for each token given the other tokens Since this study is related to classification, only [CLS] token containing a description of classification is considered So [CLS] token extract the relevance map from row 𝐶[𝐶𝐿𝑆] ∈ ℝ𝑠 This row include a score that calculates the influence of each token on classification token * CLS : Special Classification token Usually, CLS token has a combined meaning of token sequence after going through Transformer, which can be easily classified by attaching a classifier 21
  • 23. Q & A
  • 24. 4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization Setting 22 실험 데이터 세트 : ImageNet (ILSVRC) 2012 (50K images from 1000 classes) ImageNet-Segmentation (annotated subset of ImageNet) 성능 측정 방법 : positive and negative perturbation → AUC results (percents) segmentation performance → pixel accuracy, mAP (mean Average Precision), mIoU (Mask Intersection-over-Union) 성능 비교 대상 : attention map → rollout, raw-attention class-specific → GradCAM (Gradient-weighted Class Activation Map) relevance propagation → LRP (Layer-wise Relevance Propagation)
  • 25. 4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization Qualitative Evaluation Result : Qualitative Evaluation 23 Sample results. As can be seen, our method produces more accurate visualization Class-specific visualization For each image we present results for two different classes Visual comparison of various baseline and proposed method Baseline produces inconsistent visualization, while the result of the proposed method provide clearer and more consistent visualization Image with 2 classes All methods except GradCAM create similar visualizations for each class, while proposed method provides two different accurate visualization
  • 26. 4 Performance Evaluation Transformer Interpretability Beyond Attention Visualization Result : Perturbation Test, Segmentation 24 Segmentation As a result of conducting an experiment using the ImageNet- segmentation dataset : segmentation metrics (pixel-accuracy, mAP, and mIoU) The proposed method surpasses all baseline by a significant difference Segmentation performance on the ImageNet-segmentation dataset (percent) Higher is better Perturbation Tests AUC results obtained through negative and positive perturbation tests for both predicted and target class are shown The proposed method achieves better result by a large difference in both experiments (rollout, raw attention generates a certain visualization with the input image specified, so the result is excluded from the target- class test) Positive and Negative perturbation AUC results (percents) for the predicted and target classes, on ImageNet validation set. For positive perturbation lower is better, and for negative perturbation higher is better
  • 27. 5 Conclusion • Suggest a method to interpret and visualize the decision-making process of the Transformer model • Provide a specific solution for each challenge : Use a relevance propagation rule that can be used for positive and negative attribution In places such as skip connection, normalization term for non-parametric layer is used Integrate attention and relevancy scores, and combine the combined results to create multiple attention block • State-of-the-art result when compared with other Trasformer interpretation method Transformer Interpretability Beyond Attention Visualization 25 • Literature on the interpretability of the Transformer↓ : use of non-positive activation functions frequent use of skip connections challenge of modeling the matrix multiplication that is used in self-attention