I. Gkartzonika, N. Gkalelis, V. Mezaris, "Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism", Proc. ECCV 2022 Workshop on Vision with Biased or Scarce Data (VBSD), Oct. 2022.
In this paper two new learning-based eXplainable AI (XAI) methods for deep convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and L-CAM-Img, are proposed. Both methods use an attention mechanism that is inserted in the original (frozen) DCNN and is trained to derive class activation maps (CAMs) from the last convolutional layer’s feature maps. During training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image (L-CAM-Img) forcing the attention mechanism to learn the image regions explaining the DCNN’s outcome. Experimental evaluation on ImageNet shows that the proposed methods achieve competitive results while requiring a single forward pass at the inference stage. Moreover, based on the derived explanations a comprehensive qualitative analysis is performed providing valuable insight for understanding the reasons behind classification errors, including possible dataset biases affecting the trained classifier.
Learning visual explanations for DCNN-based image classifiers using an attention mechanism
1. Title of presentation
Subtitle
Name of presenter
Date
Learning visual explanations for DCNN-based image classifiers
using an attention mechanism
Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
European Conference on Computer Vision (ECCV),
Vision with Biased or Scarce data (VBSD),
October 2022
2. 2
• Deep learning (DL) models for image classification have become very successful
• However, they are too complicated and difficult to understand
• E.g.: VGG-16 trained to categorize images to one of the ImageNet categories
Introduction
VGG-16
classifier
prediction:
football helmet
ground truth:
rugby ball
prediction:
rugby ball
• Why is the first image correctly classified (although the rugby ball is barely seen)
while the second one is misclassified as “football helmet”?
3. 3
• Goal: produce a saliency map (SM) depicting the image regions that explain the
decision of the classifier
Introduction
input image SM
• Post-hoc approaches: uncover the inference mechanism of a trained model
• Different from methods that jointly train the classifier and an explanation
mechanism!
4. 4
• AI explanation should not be confused with weakly-supervised localization tasks!
• E.g.: the SMs below fail to accurately localize the objects of interest
• However, these are good SMs explaining the decision strategy of the classifier
Introduction
SM superimposed on the
image: the classifier
recognizes class “padlock” by
looking at both the padlock
and the padlock’s chains
SM superimposed on the image:
the human along with the
snowmobile help the classifier to
make its decision
Input image
ground truth:
padlock
Input image
ground truth:
snowmobile
5. 5
• Measure pixel-wise contribution to the classification confidence score
• Average Drop (AD) - average model’s confidence score drop when masked test images are used:
AD ν% = 𝑖=1
Υ max(0, 𝑓 𝐗𝑖 −𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 )
Υ𝑓 𝐗𝑖
100
• Increase in Confidence (IC) - portion of test images for which the model’s confidence score
increased when the masked images are used:
IC ν% = 𝑖=1
Υ δ( 𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 ) > 𝑓 𝐗𝑖
Υ
100
φν: threshold function to select the ν% higher-valued pixels of the image at its input
Xi, Vi: i-th input image and corresponding SM; Υ: number of test images
Evaluation measures
6. 6
• Given K feature maps (FMs) and class label r derived by the classifier for the
specified input image, utilize class-specific weights: w1
(r), w2
(r), …, wK
(r)
• Compute the weighted sum of FMs to derive the class activation map (CAM)
• Normalize (e.g. min-max) and upscale CAM to get SM
Related work: the general approach
w1
(r)
w2
(r)
wK
(r) Normalize
& upscale
CAM
feature maps
input image SM
7. 7
• Gradient-based methods: gradients backpropagated from the output to compute
the weights and produce the SM – gradients are noisy
• Perturbation-based methods: forward pass the input image perturbed by the k-th
FM; the derived score is used as weight for the corresponding FM – needs K
forward passes to produce the SM
• The above methods, compute the weights and produce SM at inference stage
• Can we use a training dataset to learn to produce class-specific SMs?
Related work: main categories
8. 8
• An attention layer of K R weights and R biases
is introduced; CAM is computed using
𝐿(𝑟)
= 𝜎
𝑘=1
𝐾
𝑤𝑘
(𝑟)
𝑨:,:,𝑘 + 𝑏(𝑟)
𝑱
σ(): element-wise sigmoid function
A:,:,k : k-th FM of last conv. layer
J: all-ones matrix, same size as A:,:,k’s
• Training set of R classes is used to train the
attention layer; original backbone is frozen
• L-CAM-Fm: L(r) multiplies each FM
• L-CAM-Img: L(r) is upscaled and multiplied
with input Image
Learning-based CAM: Training
Classifier
Output
… …
Attention layer
Target-class
label
Normalization
& upscaling
Explanation
Train.
image
Masked
image
.
Convolution
& pooling
layers, etc.
Feature maps of
last conv. layer
Classifier
Output
… …
Attention layer
Target-class
label
Explanation
Train.
image
Convolution
& pooling
layers, etc.
.
Feature maps of
last conv. layer
9. 9
• Loss function:
λ1 TV(L(r)) + λ2 AV(L(r)) + λ3 CE(r,u)
Cross entropy loss: CE(r,u)
Energy loss: AV(S) = (PQ)-1 ∑p,q (sp,q)λ4
Variation loss: TV(S) = ∑p,q[(sp,q - sp,q+1)2 + (sp,q - sp+1,q)2]
u: confidence score derived from L-CAM network for model truth class r
P,Q: width, height of FMs
Regularization parameters: λ1, λ2, λ3, λ4
• Overall loss effect: remove spurious/noise areas in the SM and retain the most
salient parts for the classification decision
Learning-based CAM: Training
10. 10
• At inference stage both L-CAM variants operate the same
• The input image is fed to the network to derive the FMs and inferred label
• The inferred label is used to select the class-specific weights and bias of the
trained attention layer and compute the explanation
Learning-based CAM: Inference
Feature maps of
last conv. layer
Classifier
Output
… …
Attention layer
Output class
(to be explained)
Input
image
Explanation
Convolution
& pooling
layers, etc.
11. 11
• Backbones: VGG-16 (512 x 14 x 14), ResNet-50 (2048 x 7 x 7)
• L-CAM variations: i) L-CAM-Fm/Img: proposed, ii) L-CAM-Fm*/Img*: only CE loss component is
used, iii) L-CAM-Fm† with VGG-16: 7 x 7 FMs after avg pool layer (for compatibility with RISE)
• Comparisons: Grad-CAM, Grad-CAM++, Score-CAM, RISE
• Measures: AD, IC for different thresholds, i.e., ν = 100%, 50%, 15%; Number of FW passes
• Training L-CAM: ImageNet training images (1.3 mil., 1000 classes)
• Testing: 2ooo randomly images are selected (as in other works), due to high computational cost
of perturbation-based approaches
Experiments
12. 12
VGG-16 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW
Grad-CAM 32.12 22.1 58.65 9.5 84.15 2.2 1
Grad-CAM++ 30.75 22.05 54.11 11.15 82.72 3.15 1
Score-CAM 27.75 22.8 45.6 14.1 75.7 4.3 512
RISE 8.74 51.3 42.42 17.55 78.7 4.45 4000
L-CAM-Fm* 20.63 31.05 51.34 13.45 82.4 3.05 1
L-CAM-Fm 16.47 35.4 47 14.45 79.39 3.65 1
L-CAM-Img* 18.01 37.2 50.88 12.05 82.1 3 1
L-CAM-Img 12.96 41.25 45.56 14.9 78.14 4.2 1
L-CAM-Fm† 12.15 40.95 37.37 20.25 74.23 4.45 1
Experimental results
• L-CAM outperforms gradient-based
methods, comparable to perturbation-
based (but faster: requires just one
forward pass, instead of hundreds or
thousands)
• L-CAM-Img is better than L-CAM-Fm,
probably because the former perturbs
directly the input image
• L-CAM-Img*Fm* have worse
performance than the proposed ones,
showing that energy and variation loss
components are important
• With 7 x 7 FMs we got the best
performance; probably due to curse of
dimensionality (learning-based
methods can learn easier the FMs’
combination in lower-dimensional
space)
ResNet-50 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW
Grad-CAM 13.61 38.01 29.28 23.05 78.61 3.4 1
Grad-CAM++ 13.63 37.95 30.37 23.45 79.58 3.4 1
Score-CAM 11.01 39.55 26.8 24.75 78.72 3.6 2048
RISE 11.12 46.15 36.31 21.55 82.05 3.2 8000
L-CAM-Fm* 14.44 35.45 32.18 20.5 80.66 2.9 1
L-CAM-Fm 12.16 40.2 29.44 23.4 78.64 4.1 1
L-CAM-Img* 15.93 32.8 39.9 14.85 84.67 2.25 1
L-CAM-Img 11.09 43.75 29.12 24.1 79.41 3.9 1
13. 13
Explanation examples
• Can produce class specific SMs
• Can correctly identify the contribution of two instances of the same class
input image
input image class-specific: Maltese class-specific: soccer ball
ground-truth: porcupine
14. 14
Explanation examples
• Misclassification due to correlated classes
ground truth: coach predicted: minibus
• Misclassification due to the presence of two different ImageNet classes in the image
predicted: Tibetan terrier
ground truth: tennis ball
input image
input image
15. 15
Explanation examples
• Why the classifier did not classify the second image correctly?
ground truth: rugby ball predicted: football helmet
input image
predicted = ground truth: rugby ball
input image
16. 16
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/L-CAM
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreements 101021866 CRiTERIA and 951911 AI4Media