SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Title of presentation
Subtitle
Name of presenter
Date
Learning visual explanations for DCNN-based image classifiers
using an attention mechanism
Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
European Conference on Computer Vision (ECCV),
Vision with Biased or Scarce data (VBSD),
October 2022
2
• Deep learning (DL) models for image classification have become very successful
• However, they are too complicated and difficult to understand
• E.g.: VGG-16 trained to categorize images to one of the ImageNet categories
Introduction
VGG-16
classifier
prediction:
football helmet
ground truth:
rugby ball
prediction:
rugby ball
• Why is the first image correctly classified (although the rugby ball is barely seen)
while the second one is misclassified as “football helmet”?
3
• Goal: produce a saliency map (SM) depicting the image regions that explain the
decision of the classifier
Introduction
input image SM
• Post-hoc approaches: uncover the inference mechanism of a trained model
• Different from methods that jointly train the classifier and an explanation
mechanism!
4
• AI explanation should not be confused with weakly-supervised localization tasks!
• E.g.: the SMs below fail to accurately localize the objects of interest
• However, these are good SMs explaining the decision strategy of the classifier
Introduction
SM superimposed on the
image: the classifier
recognizes class “padlock” by
looking at both the padlock
and the padlock’s chains
SM superimposed on the image:
the human along with the
snowmobile help the classifier to
make its decision
Input image
ground truth:
padlock
Input image
ground truth:
snowmobile
5
• Measure pixel-wise contribution to the classification confidence score
• Average Drop (AD) - average model’s confidence score drop when masked test images are used:
AD ν% = 𝑖=1
Υ max(0, 𝑓 𝐗𝑖 −𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 )
Υ𝑓 𝐗𝑖
100
• Increase in Confidence (IC) - portion of test images for which the model’s confidence score
increased when the masked images are used:
IC ν% = 𝑖=1
Υ δ( 𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 ) > 𝑓 𝐗𝑖
Υ
100
φν: threshold function to select the ν% higher-valued pixels of the image at its input
Xi, Vi: i-th input image and corresponding SM; Υ: number of test images
Evaluation measures
6
• Given K feature maps (FMs) and class label r derived by the classifier for the
specified input image, utilize class-specific weights: w1
(r), w2
(r), …, wK
(r)
• Compute the weighted sum of FMs to derive the class activation map (CAM)
• Normalize (e.g. min-max) and upscale CAM to get SM
Related work: the general approach
w1
(r)
w2
(r)
wK
(r) Normalize
& upscale
CAM
feature maps
input image SM
7
• Gradient-based methods: gradients backpropagated from the output to compute
the weights and produce the SM – gradients are noisy
• Perturbation-based methods: forward pass the input image perturbed by the k-th
FM; the derived score is used as weight for the corresponding FM – needs K
forward passes to produce the SM
• The above methods, compute the weights and produce SM at inference stage
• Can we use a training dataset to learn to produce class-specific SMs?
Related work: main categories
8
• An attention layer of K R weights and R biases
is introduced; CAM is computed using
𝐿(𝑟)
= 𝜎
𝑘=1
𝐾
𝑤𝑘
(𝑟)
𝑨:,:,𝑘 + 𝑏(𝑟)
𝑱
σ(): element-wise sigmoid function
A:,:,k : k-th FM of last conv. layer
J: all-ones matrix, same size as A:,:,k’s
• Training set of R classes is used to train the
attention layer; original backbone is frozen
• L-CAM-Fm: L(r) multiplies each FM
• L-CAM-Img: L(r) is upscaled and multiplied
with input Image
Learning-based CAM: Training
Classifier
Output
… …
Attention layer
Target-class
label
Normalization
& upscaling
Explanation
Train.
image
Masked
image
.
Convolution
& pooling
layers, etc.
Feature maps of
last conv. layer
Classifier
Output
… …
Attention layer
Target-class
label
Explanation
Train.
image
Convolution
& pooling
layers, etc.
.
Feature maps of
last conv. layer
9
• Loss function:
λ1 TV(L(r)) + λ2 AV(L(r)) + λ3 CE(r,u)
Cross entropy loss: CE(r,u)
Energy loss: AV(S) = (PQ)-1 ∑p,q (sp,q)λ4
Variation loss: TV(S) = ∑p,q[(sp,q - sp,q+1)2 + (sp,q - sp+1,q)2]
u: confidence score derived from L-CAM network for model truth class r
P,Q: width, height of FMs
Regularization parameters: λ1, λ2, λ3, λ4
• Overall loss effect: remove spurious/noise areas in the SM and retain the most
salient parts for the classification decision
Learning-based CAM: Training
10
• At inference stage both L-CAM variants operate the same
• The input image is fed to the network to derive the FMs and inferred label
• The inferred label is used to select the class-specific weights and bias of the
trained attention layer and compute the explanation
Learning-based CAM: Inference
Feature maps of
last conv. layer
Classifier
Output
… …
Attention layer
Output class
(to be explained)
Input
image
Explanation
Convolution
& pooling
layers, etc.
11
• Backbones: VGG-16 (512 x 14 x 14), ResNet-50 (2048 x 7 x 7)
• L-CAM variations: i) L-CAM-Fm/Img: proposed, ii) L-CAM-Fm*/Img*: only CE loss component is
used, iii) L-CAM-Fm† with VGG-16: 7 x 7 FMs after avg pool layer (for compatibility with RISE)
• Comparisons: Grad-CAM, Grad-CAM++, Score-CAM, RISE
• Measures: AD, IC for different thresholds, i.e., ν = 100%, 50%, 15%; Number of FW passes
• Training L-CAM: ImageNet training images (1.3 mil., 1000 classes)
• Testing: 2ooo randomly images are selected (as in other works), due to high computational cost
of perturbation-based approaches
Experiments
12
VGG-16 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW
Grad-CAM 32.12 22.1 58.65 9.5 84.15 2.2 1
Grad-CAM++ 30.75 22.05 54.11 11.15 82.72 3.15 1
Score-CAM 27.75 22.8 45.6 14.1 75.7 4.3 512
RISE 8.74 51.3 42.42 17.55 78.7 4.45 4000
L-CAM-Fm* 20.63 31.05 51.34 13.45 82.4 3.05 1
L-CAM-Fm 16.47 35.4 47 14.45 79.39 3.65 1
L-CAM-Img* 18.01 37.2 50.88 12.05 82.1 3 1
L-CAM-Img 12.96 41.25 45.56 14.9 78.14 4.2 1
L-CAM-Fm† 12.15 40.95 37.37 20.25 74.23 4.45 1
Experimental results
• L-CAM outperforms gradient-based
methods, comparable to perturbation-
based (but faster: requires just one
forward pass, instead of hundreds or
thousands)
• L-CAM-Img is better than L-CAM-Fm,
probably because the former perturbs
directly the input image
• L-CAM-Img*Fm* have worse
performance than the proposed ones,
showing that energy and variation loss
components are important
• With 7 x 7 FMs we got the best
performance; probably due to curse of
dimensionality (learning-based
methods can learn easier the FMs’
combination in lower-dimensional
space)
ResNet-50 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW
Grad-CAM 13.61 38.01 29.28 23.05 78.61 3.4 1
Grad-CAM++ 13.63 37.95 30.37 23.45 79.58 3.4 1
Score-CAM 11.01 39.55 26.8 24.75 78.72 3.6 2048
RISE 11.12 46.15 36.31 21.55 82.05 3.2 8000
L-CAM-Fm* 14.44 35.45 32.18 20.5 80.66 2.9 1
L-CAM-Fm 12.16 40.2 29.44 23.4 78.64 4.1 1
L-CAM-Img* 15.93 32.8 39.9 14.85 84.67 2.25 1
L-CAM-Img 11.09 43.75 29.12 24.1 79.41 3.9 1
13
Explanation examples
• Can produce class specific SMs
• Can correctly identify the contribution of two instances of the same class
input image
input image class-specific: Maltese class-specific: soccer ball
ground-truth: porcupine
14
Explanation examples
• Misclassification due to correlated classes
ground truth: coach predicted: minibus
• Misclassification due to the presence of two different ImageNet classes in the image
predicted: Tibetan terrier
ground truth: tennis ball
input image
input image
15
Explanation examples
• Why the classifier did not classify the second image correctly?
ground truth: rugby ball predicted: football helmet
input image
predicted = ground truth: rugby ball
input image
16
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/L-CAM
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreements 101021866 CRiTERIA and 951911 AI4Media

Weitere ähnliche Inhalte

Ähnlich wie Learning visual explanations for DCNN-based image classifiers using an attention mechanism

Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
JaeJun Yoo
 
Deep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image EnhancementDeep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image Enhancement
Sean Moran
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
MohammadAzreeYahaya
 
Cj31365368
Cj31365368Cj31365368
Cj31365368
IJMER
 

Ähnlich wie Learning visual explanations for DCNN-based image classifiers using an attention mechanism (20)

Supervised Learning of Semantic Classes for Image Annotation and Retrieval
Supervised Learning of Semantic Classes for Image Annotation and RetrievalSupervised Learning of Semantic Classes for Image Annotation and Retrieval
Supervised Learning of Semantic Classes for Image Annotation and Retrieval
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
NetVLAD: CNN architecture for weakly supervised place recognition
NetVLAD:  CNN architecture for weakly supervised place recognitionNetVLAD:  CNN architecture for weakly supervised place recognition
NetVLAD: CNN architecture for weakly supervised place recognition
 
Time series Forecasting using svm
Time series Forecasting using  svmTime series Forecasting using  svm
Time series Forecasting using svm
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
Realtime pothole detection system using improved CNN Models
Realtime pothole detection system using improved CNN ModelsRealtime pothole detection system using improved CNN Models
Realtime pothole detection system using improved CNN Models
 
Fa19_P2.pptx
Fa19_P2.pptxFa19_P2.pptx
Fa19_P2.pptx
 
Deep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image EnhancementDeep Local Parametric Filters for Image Enhancement
Deep Local Parametric Filters for Image Enhancement
 
Kaggle Google Landmark recognition
Kaggle Google Landmark recognitionKaggle Google Landmark recognition
Kaggle Google Landmark recognition
 
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSINGLAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
 
CyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdfCyberSec_JPEGcompressionForensics.pdf
CyberSec_JPEGcompressionForensics.pdf
 
COMPUTER GRAPHICS
COMPUTER GRAPHICSCOMPUTER GRAPHICS
COMPUTER GRAPHICS
 
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 AlgorithmSelf Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
Self Organizing Feature Map(SOM), Topographic Product, Cascade 2 Algorithm
 
Cj31365368
Cj31365368Cj31365368
Cj31365368
 
COM2304: Intensity Transformation and Spatial Filtering – III Spatial Filters...
COM2304: Intensity Transformation and Spatial Filtering – III Spatial Filters...COM2304: Intensity Transformation and Spatial Filtering – III Spatial Filters...
COM2304: Intensity Transformation and Spatial Filtering – III Spatial Filters...
 
JPEG Image Compression
JPEG Image CompressionJPEG Image Compression
JPEG Image Compression
 
Adaptive Beamforming Algorithms
Adaptive Beamforming Algorithms Adaptive Beamforming Algorithms
Adaptive Beamforming Algorithms
 
Advanced Multimedia
Advanced MultimediaAdvanced Multimedia
Advanced Multimedia
 
Foreground Detection : Combining Background Subspace Learning with Object Smo...
Foreground Detection : Combining Background Subspace Learning with Object Smo...Foreground Detection : Combining Background Subspace Learning with Object Smo...
Foreground Detection : Combining Background Subspace Learning with Object Smo...
 

Mehr von VasileiosMezaris

Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
VasileiosMezaris
 

Mehr von VasileiosMezaris (20)

Multi-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and LocalizationMulti-Modal Fusion for Image Manipulation Detection and Localization
Multi-Modal Fusion for Image Manipulation Detection and Localization
 
CERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages TaskCERTH-ITI at MediaEval 2023 NewsImages Task
CERTH-ITI at MediaEval 2023 NewsImages Task
 
Spatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees VideosSpatio-Temporal Summarization of 360-degrees Videos
Spatio-Temporal Summarization of 360-degrees Videos
 
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...
 
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022
 
Gated-ViGAT
Gated-ViGATGated-ViGAT
Gated-ViGAT
 
Explaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attentionExplaining video summarization based on the focus of attention
Explaining video summarization based on the focus of attention
 
Combining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video SearchCombining textual and visual features for Ad-hoc Video Search
Combining textual and visual features for Ad-hoc Video Search
 
Are all combinations equal? Combining textual and visual features with multi...
Are all combinations equal?  Combining textual and visual features with multi...Are all combinations equal?  Combining textual and visual features with multi...
Are all combinations equal? Combining textual and visual features with multi...
 
CA-SUM Video Summarization
CA-SUM Video SummarizationCA-SUM Video Summarization
CA-SUM Video Summarization
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
PGL SUM Video Summarization
PGL SUM Video SummarizationPGL SUM Video Summarization
PGL SUM Video Summarization
 
Video Thumbnail Selector
Video Thumbnail SelectorVideo Thumbnail Selector
Video Thumbnail Selector
 
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal RetrievalHard-Negatives Selection Strategy for Cross-Modal Retrieval
Hard-Negatives Selection Strategy for Cross-Modal Retrieval
 
Misinformation on the internet: Video and AI
Misinformation on the internet: Video and AIMisinformation on the internet: Video and AI
Misinformation on the internet: Video and AI
 
LSTM Structured Pruning
LSTM Structured PruningLSTM Structured Pruning
LSTM Structured Pruning
 
PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020PoR_evaluation_measure_acm_mm_2020
PoR_evaluation_measure_acm_mm_2020
 
GAN-based video summarization
GAN-based video summarizationGAN-based video summarization
GAN-based video summarization
 
Migration-related video retrieval
Migration-related video retrievalMigration-related video retrieval
Migration-related video retrieval
 
Fractional step discriminant pruning
Fractional step discriminant pruningFractional step discriminant pruning
Fractional step discriminant pruning
 

Kürzlich hochgeladen

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Kürzlich hochgeladen (20)

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

Learning visual explanations for DCNN-based image classifiers using an attention mechanism

  • 1. Title of presentation Subtitle Name of presenter Date Learning visual explanations for DCNN-based image classifiers using an attention mechanism Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece European Conference on Computer Vision (ECCV), Vision with Biased or Scarce data (VBSD), October 2022
  • 2. 2 • Deep learning (DL) models for image classification have become very successful • However, they are too complicated and difficult to understand • E.g.: VGG-16 trained to categorize images to one of the ImageNet categories Introduction VGG-16 classifier prediction: football helmet ground truth: rugby ball prediction: rugby ball • Why is the first image correctly classified (although the rugby ball is barely seen) while the second one is misclassified as “football helmet”?
  • 3. 3 • Goal: produce a saliency map (SM) depicting the image regions that explain the decision of the classifier Introduction input image SM • Post-hoc approaches: uncover the inference mechanism of a trained model • Different from methods that jointly train the classifier and an explanation mechanism!
  • 4. 4 • AI explanation should not be confused with weakly-supervised localization tasks! • E.g.: the SMs below fail to accurately localize the objects of interest • However, these are good SMs explaining the decision strategy of the classifier Introduction SM superimposed on the image: the classifier recognizes class “padlock” by looking at both the padlock and the padlock’s chains SM superimposed on the image: the human along with the snowmobile help the classifier to make its decision Input image ground truth: padlock Input image ground truth: snowmobile
  • 5. 5 • Measure pixel-wise contribution to the classification confidence score • Average Drop (AD) - average model’s confidence score drop when masked test images are used: AD ν% = 𝑖=1 Υ max(0, 𝑓 𝐗𝑖 −𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 ) Υ𝑓 𝐗𝑖 100 • Increase in Confidence (IC) - portion of test images for which the model’s confidence score increased when the masked images are used: IC ν% = 𝑖=1 Υ δ( 𝑓 𝐗𝑖⊙𝜑𝜈 𝐕𝑖 ) > 𝑓 𝐗𝑖 Υ 100 φν: threshold function to select the ν% higher-valued pixels of the image at its input Xi, Vi: i-th input image and corresponding SM; Υ: number of test images Evaluation measures
  • 6. 6 • Given K feature maps (FMs) and class label r derived by the classifier for the specified input image, utilize class-specific weights: w1 (r), w2 (r), …, wK (r) • Compute the weighted sum of FMs to derive the class activation map (CAM) • Normalize (e.g. min-max) and upscale CAM to get SM Related work: the general approach w1 (r) w2 (r) wK (r) Normalize & upscale CAM feature maps input image SM
  • 7. 7 • Gradient-based methods: gradients backpropagated from the output to compute the weights and produce the SM – gradients are noisy • Perturbation-based methods: forward pass the input image perturbed by the k-th FM; the derived score is used as weight for the corresponding FM – needs K forward passes to produce the SM • The above methods, compute the weights and produce SM at inference stage • Can we use a training dataset to learn to produce class-specific SMs? Related work: main categories
  • 8. 8 • An attention layer of K R weights and R biases is introduced; CAM is computed using 𝐿(𝑟) = 𝜎 𝑘=1 𝐾 𝑤𝑘 (𝑟) 𝑨:,:,𝑘 + 𝑏(𝑟) 𝑱 σ(): element-wise sigmoid function A:,:,k : k-th FM of last conv. layer J: all-ones matrix, same size as A:,:,k’s • Training set of R classes is used to train the attention layer; original backbone is frozen • L-CAM-Fm: L(r) multiplies each FM • L-CAM-Img: L(r) is upscaled and multiplied with input Image Learning-based CAM: Training Classifier Output … … Attention layer Target-class label Normalization & upscaling Explanation Train. image Masked image . Convolution & pooling layers, etc. Feature maps of last conv. layer Classifier Output … … Attention layer Target-class label Explanation Train. image Convolution & pooling layers, etc. . Feature maps of last conv. layer
  • 9. 9 • Loss function: λ1 TV(L(r)) + λ2 AV(L(r)) + λ3 CE(r,u) Cross entropy loss: CE(r,u) Energy loss: AV(S) = (PQ)-1 ∑p,q (sp,q)λ4 Variation loss: TV(S) = ∑p,q[(sp,q - sp,q+1)2 + (sp,q - sp+1,q)2] u: confidence score derived from L-CAM network for model truth class r P,Q: width, height of FMs Regularization parameters: λ1, λ2, λ3, λ4 • Overall loss effect: remove spurious/noise areas in the SM and retain the most salient parts for the classification decision Learning-based CAM: Training
  • 10. 10 • At inference stage both L-CAM variants operate the same • The input image is fed to the network to derive the FMs and inferred label • The inferred label is used to select the class-specific weights and bias of the trained attention layer and compute the explanation Learning-based CAM: Inference Feature maps of last conv. layer Classifier Output … … Attention layer Output class (to be explained) Input image Explanation Convolution & pooling layers, etc.
  • 11. 11 • Backbones: VGG-16 (512 x 14 x 14), ResNet-50 (2048 x 7 x 7) • L-CAM variations: i) L-CAM-Fm/Img: proposed, ii) L-CAM-Fm*/Img*: only CE loss component is used, iii) L-CAM-Fm† with VGG-16: 7 x 7 FMs after avg pool layer (for compatibility with RISE) • Comparisons: Grad-CAM, Grad-CAM++, Score-CAM, RISE • Measures: AD, IC for different thresholds, i.e., ν = 100%, 50%, 15%; Number of FW passes • Training L-CAM: ImageNet training images (1.3 mil., 1000 classes) • Testing: 2ooo randomly images are selected (as in other works), due to high computational cost of perturbation-based approaches Experiments
  • 12. 12 VGG-16 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW Grad-CAM 32.12 22.1 58.65 9.5 84.15 2.2 1 Grad-CAM++ 30.75 22.05 54.11 11.15 82.72 3.15 1 Score-CAM 27.75 22.8 45.6 14.1 75.7 4.3 512 RISE 8.74 51.3 42.42 17.55 78.7 4.45 4000 L-CAM-Fm* 20.63 31.05 51.34 13.45 82.4 3.05 1 L-CAM-Fm 16.47 35.4 47 14.45 79.39 3.65 1 L-CAM-Img* 18.01 37.2 50.88 12.05 82.1 3 1 L-CAM-Img 12.96 41.25 45.56 14.9 78.14 4.2 1 L-CAM-Fm† 12.15 40.95 37.37 20.25 74.23 4.45 1 Experimental results • L-CAM outperforms gradient-based methods, comparable to perturbation- based (but faster: requires just one forward pass, instead of hundreds or thousands) • L-CAM-Img is better than L-CAM-Fm, probably because the former perturbs directly the input image • L-CAM-Img*Fm* have worse performance than the proposed ones, showing that energy and variation loss components are important • With 7 x 7 FMs we got the best performance; probably due to curse of dimensionality (learning-based methods can learn easier the FMs’ combination in lower-dimensional space) ResNet-50 AD(100%) IC(100%) AD(50%) IC(50%) AD(15%) IC(15%) #FW Grad-CAM 13.61 38.01 29.28 23.05 78.61 3.4 1 Grad-CAM++ 13.63 37.95 30.37 23.45 79.58 3.4 1 Score-CAM 11.01 39.55 26.8 24.75 78.72 3.6 2048 RISE 11.12 46.15 36.31 21.55 82.05 3.2 8000 L-CAM-Fm* 14.44 35.45 32.18 20.5 80.66 2.9 1 L-CAM-Fm 12.16 40.2 29.44 23.4 78.64 4.1 1 L-CAM-Img* 15.93 32.8 39.9 14.85 84.67 2.25 1 L-CAM-Img 11.09 43.75 29.12 24.1 79.41 3.9 1
  • 13. 13 Explanation examples • Can produce class specific SMs • Can correctly identify the contribution of two instances of the same class input image input image class-specific: Maltese class-specific: soccer ball ground-truth: porcupine
  • 14. 14 Explanation examples • Misclassification due to correlated classes ground truth: coach predicted: minibus • Misclassification due to the presence of two different ImageNet classes in the image predicted: Tibetan terrier ground truth: tennis ball input image input image
  • 15. 15 Explanation examples • Why the classifier did not classify the second image correctly? ground truth: rugby ball predicted: football helmet input image predicted = ground truth: rugby ball input image
  • 16. 16 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code publicly available at: https://github.com/bmezaris/L-CAM This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreements 101021866 CRiTERIA and 951911 AI4Media