Learning visual explanations for DCNN-based image classifiers using an attention mechanism

Title of presentation
Subtitle
Name of presenter
Date
Learning visual explanations for DCNN-based image classifiers
using an attention mechanism
Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
European Conference on Computer Vision (ECCV),
Vision with Biased or Scarce data (VBSD),
October 2022

2
• Deep learning (DL) models for image classification have become very successful
• However, they are too complicated and difficult to understand
• E.g.: VGG-16 trained to categorize images to one of the ImageNet categories
Introduction
VGG-16
classifier
prediction:
football helmet
ground truth:
rugby ball
prediction:
rugby ball
• Why is the first image correctly classified (although the rugby ball is barely seen)
while the second one is misclassified as “football helmet”?

3
• Goal: produce a saliency map (SM) depicting the image regions that explain the
decision of the classifier
Introduction
input image SM
• Post-hoc approaches: uncover the inference mechanism of a trained model
• Different from methods that jointly train the classifier and an explanation
mechanism!

4
• AI explanation should not be confused with weakly-supervised localization tasks!
• E.g.: the SMs below fail to accurately localize the objects of interest
• However, these are good SMs explaining the decision strategy of the classifier
Introduction
SM superimposed on the
image: the classifier
recognizes class “padlock” by
looking at both the padlock
and the padlock’s chains
SM superimposed on the image:
the human along with the
snowmobile help the classifier to
make its decision
Input image
ground truth:
padlock
Input image
ground truth:
snowmobile

5
• Measure pixel-wise contribution to the classification confidence score
• Average Drop (AD) - average model’s confidence score drop when masked test images are used:
AD ν% = 𝑖=1
Υ max(0, 𝑓 𝐗𝑖 −𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 )
Υ𝑓 𝐗𝑖
100
• Increase in Confidence (IC) - portion of test images for which the model’s confidence score
increased when the masked images are used:
IC ν% = 𝑖=1
Υ δ( 𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 ) > 𝑓 𝐗𝑖
Υ
100
φν: threshold function to select the ν% higher-valued pixels of the image at its input
Xi, Vi: i-th input image and corresponding SM; Υ: number of test images
Evaluation measures

6
• Given K feature maps (FMs) and class label r derived by the classifier for the
specified input image, utilize class-specific weights: w1
(r), w2
(r), …, wK
(r)
• Compute the weighted sum of FMs to derive the class activation map (CAM)
• Normalize (e.g. min-max) and upscale CAM to get SM
Related work: the general approach
w1
(r)
w2
(r)
wK
(r) Normalize
& upscale
CAM
feature maps
input image SM

7
• Gradient-based methods: gradients backpropagated from the output to compute
the weights and produce the SM – gradients are noisy
• Perturbation-based methods: forward pass the input image perturbed by the k-th
FM; the derived score is used as weight for the corresponding FM – needs K
forward passes to produce the SM
• The above methods, compute the weights and produce SM at inference stage
• Can we use a training dataset to learn to produce class-specific SMs?
Related work: main categories

8
• An attention layer of K R weights and R biases
is introduced; CAM is computed using
𝐿(𝑟)
= 𝜎
𝑘=1
𝐾
𝑤𝑘
(𝑟)
𝑨:,:,𝑘 + 𝑏(𝑟)
𝑱
σ(): element-wise sigmoid function
A:,:,k : k-th FM of last conv. layer
J: all-ones matrix, same size as A:,:,k’s
• Training set of R classes is used to train the
attention layer; original backbone is frozen
• L-CAM-Fm: L(r) multiplies each FM
• L-CAM-Img: L(r) is upscaled and multiplied
with input Image
Learning-based CAM: Training
Classifier
Output
… …
Attention layer
Target-class
label
Normalization
& upscaling
Explanation
Train.
image
Masked
image
.
Convolution
& pooling
layers, etc.
Feature maps of
last conv. layer
Classifier
Output
… …
Attention layer
Target-class
label
Explanation
Train.
image
Convolution
& pooling
layers, etc.
.
Feature maps of
last conv. layer

9
• Loss function:
λ1 TV(L(r)) + λ2 AV(L(r)) + λ3 CE(r,u)
Cross entropy loss: CE(r,u)
Energy loss: AV(S) = (PQ)-1 ∑p,q (sp,q)λ4
Variation loss: TV(S) = ∑p,q[(sp,q - sp,q+1)2 + (sp,q - sp+1,q)2]
u: confidence score derived from L-CAM network for model truth class r
P,Q: width, height of FMs
Regularization parameters: λ1, λ2, λ3, λ4
• Overall loss effect: remove spurious/noise areas in the SM and retain the most
salient parts for the classification decision
Learning-based CAM: Training

10
• At inference stage both L-CAM variants operate the same
• The input image is fed to the network to derive the FMs and inferred label
• The inferred label is used to select the class-specific weights and bias of the
trained attention layer and compute the explanation
Learning-based CAM: Inference
Feature maps of
last conv. layer
Classifier
Output
… …
Attention layer
Output class
(to be explained)
Input
image
Explanation
Convolution
& pooling
layers, etc.

11
• Backbones: VGG-16 (512 x 14 x 14), ResNet-50 (2048 x 7 x 7)
• L-CAM variations: i) L-CAM-Fm/Img: proposed, ii) L-CAM-Fm*/Img*: only CE loss component is
used, iii) L-CAM-Fm† with VGG-16: 7 x 7 FMs after avg pool layer (for compatibility with RISE)
• Comparisons: Grad-CAM, Grad-CAM++, Score-CAM, RISE
• Measures: AD, IC for different thresholds, i.e., ν = 100%, 50%, 15%; Number of FW passes
• Training L-CAM: ImageNet training images (1.3 mil., 1000 classes)
• Testing: 2ooo randomly images are selected (as in other works), due to high computational cost
of perturbation-based approaches
Experiments

12
VGG-16 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW
Grad-CAM 32.12 22.1 58.65 9.5 84.15 2.2 1
Grad-CAM++ 30.75 22.05 54.11 11.15 82.72 3.15 1
Score-CAM 27.75 22.8 45.6 14.1 75.7 4.3 512
RISE 8.74 51.3 42.42 17.55 78.7 4.45 4000
L-CAM-Fm* 20.63 31.05 51.34 13.45 82.4 3.05 1
L-CAM-Fm 16.47 35.4 47 14.45 79.39 3.65 1
L-CAM-Img* 18.01 37.2 50.88 12.05 82.1 3 1
L-CAM-Img 12.96 41.25 45.56 14.9 78.14 4.2 1
L-CAM-Fm† 12.15 40.95 37.37 20.25 74.23 4.45 1
Experimental results
• L-CAM outperforms gradient-based
methods, comparable to perturbation-
based (but faster: requires just one
forward pass, instead of hundreds or
thousands)
• L-CAM-Img is better than L-CAM-Fm,
probably because the former perturbs
directly the input image
• L-CAM-Img*Fm* have worse
performance than the proposed ones,
showing that energy and variation loss
components are important
• With 7 x 7 FMs we got the best
performance; probably due to curse of
dimensionality (learning-based
methods can learn easier the FMs’
combination in lower-dimensional
space)
ResNet-50 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW
Grad-CAM 13.61 38.01 29.28 23.05 78.61 3.4 1
Grad-CAM++ 13.63 37.95 30.37 23.45 79.58 3.4 1
Score-CAM 11.01 39.55 26.8 24.75 78.72 3.6 2048
RISE 11.12 46.15 36.31 21.55 82.05 3.2 8000
L-CAM-Fm* 14.44 35.45 32.18 20.5 80.66 2.9 1
L-CAM-Fm 12.16 40.2 29.44 23.4 78.64 4.1 1
L-CAM-Img* 15.93 32.8 39.9 14.85 84.67 2.25 1
L-CAM-Img 11.09 43.75 29.12 24.1 79.41 3.9 1

13
Explanation examples
• Can produce class specific SMs
• Can correctly identify the contribution of two instances of the same class
input image
input image class-specific: Maltese class-specific: soccer ball
ground-truth: porcupine

14
• Misclassification due to correlated classes
ground truth: coach predicted: minibus
• Misclassification due to the presence of two different ImageNet classes in the image
predicted: Tibetan terrier
ground truth: tennis ball
input image
input image

15
• Why the classifier did not classify the second image correctly?
ground truth: rugby ball predicted: football helmet
input image
predicted = ground truth: rugby ball
input image

16
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/L-CAM
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreements 101021866 CRiTERIA and 951911 AI4Media

Learning visual explanations for DCNN-based image classifiers using an attention mechanism

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Learning visual explanations for DCNN-based image classifiers using an attention mechanism

Ähnlich wie Learning visual explanations for DCNN-based image classifiers using an attention mechanism (20)

Mehr von VasileiosMezaris

Mehr von VasileiosMezaris (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Learning visual explanations for DCNN-based image classifiers using an attention mechanism