This document summarizes image segmentation techniques using deep learning. It begins with an overview of semantic segmentation and instance segmentation. It then discusses several techniques for semantic segmentation, including deconvolution/transposed convolution for learnable upsampling, skip connections to combine predictions from different CNN depths, and dilated convolutions to increase the receptive field without losing resolution. For instance segmentation, it covers proposal-based methods like Mask R-CNN, and single-shot and recurrent approaches as alternatives to proposal-based models.
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Â
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonDL 2020
1. Image Segmentation with Deep Learning
Xavier Giro-i-Nieto
UPC & BSC Barcelona
Carles Ventura
UOC Barcelona
2. Xavier Giro-i-Nieto
Associate Professor at Universitat Politecnica
de Catalunya (UPC) in Barcelona, Catalonia.
IDEAI Center for
Intelligent Data Science
& ArtiïŹcial Intelligence
@DocXavi
xavier.giro@upc.edu
6. Acknowledgements
6
Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat PolitĂšcnica de Catalunya
[DLCV 2016]
VerĂłnica Vilaplana
veronica.vilaplana@upc.edu
Associate Professor
Universitat PolitĂšcnica de Catalunya
[DLCV 2017]
MĂriam Bellver
miriam.bellver@bsc.edu
PhD Candidate
Barcelona Supercomputing Center
[DLCV 2018] [DLCV 2018]
7. From image to pixels classiïŹcation (segmentation)
7
Slide inspired by cs231n lecture from Stanford University.
Image
Segmentation
Object Detection
Image
Classification
âchairâ, âbinâ âchairâ âbinâ âchairâ âbinâ
15. From Image to Pixel ClassiïŹcation (Segmentation)
15
16. Slide: CS231n (Stanford University)
CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
16
From Image to Pixel ClassiïŹcation (Segmentation)
Naive approach: Train a sliding window classiïŹer.
17. Slide: CS231n (Stanford University)
CNN COW
Extract
patch
Run through
a CNN
Classify
center pixel
Repeat for
every pixel
17
From Image to Pixel ClassiïŹcation (Segmentation)
Naive approach: Train a sliding window classiïŹer.
18. CNN
Convolutionize: Run âfully convolutionalâ network to get all pixels at once.
18
From Global to Local-scale Image ClassiïŹcation
Slide: CS231n (Stanford University)
19. CNN
Convolutionize: Run âfully convolutionalâ network to get all pixels at once.
19
Slide concept: CS231n (Stanford University)
From Global to Local-scale Image ClassiïŹcation
20. Convolutionize: Formulate each neuron in a fully connected (FC) layer as a
convolutional ïŹlter (kernel) of a convolutional layer:
20
3x2x2 tensor
(RGB image of 2x2)
2 fully connected
neurons
3x2x2 * 2 weights
2 convolutional ïŹlters of 3 x 2 x 2
(same size as input tensor)
3x2x2 * 2 weights
From Global to Local-scale Image ClassiïŹcation
21. 21
A model trained for image classiïŹcation on low-deïŹnition images can provide local
response when fed with high-deïŹnition images.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015. (original ïŹgure has been modiïŹed)
From Global to Local-scale Image ClassiïŹcation
22. 22Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR
2015. (original ïŹgure has been modiïŹed)
From Global to Local-scale Image ClassiïŹcation
CNN
Convolutionize: Run âfully convolutionalâ network to get all pixels at once...
23. 23
From Global to Local-scale Image ClassiïŹcation
Campos, V., Jou, B., & Giro-i-Nieto, X. . From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction.
Image and Vision Computing. (2017)
The FC to Conv redeïŹnition allows generating heatmaps of the class prediction over
the input images.
24. 24
From Global to Local-scale Image ClassiïŹcation
Limitation:
Pooling layers in the CNN will
decrease the spatial deïŹnition of the
output.
Figure: Alicja Kwasniewska (ISSonDL 2020)
25. 25
From Global to Local-scale Image ClassiïŹcation
CNN
Limitation: Pooling layers in the CNN will decrease the spatial deïŹnition of
the output.
Slide concept: CS231n (Stanford University)
31. Semantic Segmentation
31
CNN
Limitation of convolutionizing CNNs for image classiïŹcation:
Pooling layers in the CNN will decrease the spatial deïŹnition of the output.
Slide concept: CS231n (Stanford University)
38. Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
38
Slide credit: CS231n (Stanford University)
39. Reminder: Convolutional Layer
Typical 3 x 3 convolution, stride 2 pad 1
Input: 4 x 4 Output: 2 x 2
Dot product
between filter
and input
39
Slide credit: CS231n (Stanford University)
40. 3 x 3 âdeconvolutionâ, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
40
Slide credit: CS231n (Stanford University)
Learnable upsampling with Transposed Convolutions
41. 3 x 3 âdeconvolutionâ, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Learnable Upsample: Transposed Convolution
41
Slide credit: CS231n (Stanford University)
42. Learnable Upsample: Transposed Convolution
Slide Credit: CS231n
3 x 3 âdeconvolutionâ, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Input gives
weight for
filter values
Sum where
output overlaps
42
44. 44
Limitation of upsampling from deep CNN layers: Deeper layers
are specialized for higher-level semantic tasks, not in capturing
ïŹne-grained details required for segmentation.
Highest activations along CNN depth
Learnable Upsample
46. 46#U-Net Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image
segmentation." MICCAI 2015
Skip connections to intermediate layers
49. Dilated Convolutions
â By adding more layers:
â The receptive field grows exponentially.
â The number of learnable parameters (filter weights) grows linearly.
49
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. ICLR 2016.
57. Proposal-based
Slide Credit: CS231nHariharan et al. Simultaneous Detection and Segmentation. ECCV 2014
External
Segment
proposals
Mask out background
with mean image
Similar to R-CNN, but with segment proposals
57
58. Proposal based: Detection - Faster R-CNN
Conv
layers
Region Proposal Network
FC6
Class probabilities
FC7
FC8
RPN Proposals
RoI
Pooling
Conv5_3
RPN Proposals
58
Ren et al. Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS 2015
Learn proposals end-to-end sharing parameters with the classification network
59. He et al. Mask R-CNN. ICCV 2017
Proposal-based Instance Segmentation: Mask R-CNN
Faster R-CNN for Pixel Level Segmentation as a parallel prediction of masks
and class labels
59
60. Mask R-CNN
He et al. Mask R-CNN. ICCV 2017
Object Detection Object Detection and Segmentation
61. He et al. Mask R-CNN. ICCV 2017
Mask R-CNN: RoI Align
RoI Pool from Fast R-CNN
Hi-res input image:
3 x 800 x 600
with region
proposal
Convolution
and Pooling
Hi-res conv features:
C x H x W
with region proposal
Fully-connected
layers
Max-pool within
each grid cell
RoI conv features:
C x h x w
for region proposal
Fully-connected layers expect
low-res conv features:
C x h x w
x/16 & rounding â misalignment ! + not differentiable
61
63. Limitations of Proposal-based models
63
1. Two objects might share the same bounding box: Only
one will be kept after NMS step.
2. Choice of NMS threshold is application dependant
3. Same pixel can be assigned to multiple instances
4. Number of predictions is limited by the number of
proposals.
64. Single-shot Instance Segmentation
64
â Improving RetinaNet (single-shot object detector) in three ways:
â Integrating instance mask prediction
â Making the loss function adaptive and more stable
â Including hard examples in training
#RetinaMask Fu et al. RetinaMars: Learning to predict masks improves state-of-the-art single-shot detection for free.
ArXiv 2019
65. 65
CNN Cat
A Krizhevsky, I Sutskever, GE Hinton âImagenet classiïŹcation with deep convolutional neural networksâ NIPS 2012
78. Segmentation Datasets
â Real indoor & outdoor scenes
â 80 categories
â +300,000 images
â 2M instances
â Partial annotations
â Semantic segmentation GT
â Instance segmentation GT
â Objects, but no stuïŹ
COCO Common Objects in Context
78
â Real general scenes
â +150 categories
â +22,000 images
â Semantic segmentation GT
â Instance + parts segmentation GT
â Objects and stuïŹ
ADE20K
79. Segmentation Datasets
79
â Real general scenes
â 350 categories
â +950,000 of images
â 2,700,00 instance segmentations
â Instance segmentation GT
â Objects
Open Images V6
80. Segmentation Datasets
80
â Real general scenes
â 1,000 categories
â 164,000 of images
â 2,200,00 instance segmentations
â 11.2 objects instance from 3.4
categories on average per image
(more complex images than Open
Images and MS COCO)
â Instance segmentation GT
â Objects
LVIS
81. Segmentation Datasets
â Real driving scenes
â 30 categories
â +25,000 images
â 20,000 partial annotations
â 5,000 dense annotations
â Semantic segmentation GT
â Instance segmentation GT
â Depth, GPS and other metadata
â Objects and stuïŹ
â Real driving scenes covering 6
continents with variety of
weather/season/time of
day/camera/viewpoint
â 152 categories
â 25,000 images
â Semantic segmentation GT
â Instance + parts segmentation GT
â Objects and stuïŹ
CityScapes Mapillary Vistas Dataset
81