Modern Convolutional Neural Network techniques for image segmentation

Modern Convolutional Neural Network
techniques for image segmentation
Deep Learning Journal Club
Gioele Ciaparrone
Michele Curci
November 30, 2016
University of Salerno

Index
1. Introduction
2. The Inception architecture
3. Fully convolutional networks
4. Hypercolumns
5. Conclusion
2

CNN recap
• Sequence of convolutional and pooling layers
• Rectiﬁer activation function
• Fully connected layers at the end
• Softmax function for classiﬁcation
4

Convolution II
Valid padding (left) and same padding (right) convolutions
6

LeNet-5 (1989-1998)
• First CNN (1989) proven to work well, used for handwritten Zip
code recognition [1]
• Reﬁned through the years until the LeNet-5 version (1998) [2]
7

LeNet-5 interactive visualization [3]
It’s possible to interact with the network in 3D, manually drawing a digit
to be classiﬁed, clicking on the neurons to get info about the parameters
and the connected units, or rotating and zooming the network:
http://scs.ryerson.ca/~aharley/vis/conv/
8

AlexNet (2012) [5]
• After a long hiatus in which deep learning was ignored [4], they
received attention once again after Alex Krizhevsky overwhelmingly
won the ILSVRC in 2012 with AlexNet
• Structure very similar to LeNet-5, but with some new key insights:
very eﬃcient GPU implementation, ReLU neurons and dropout
9

Motivations
• Increasing model size tends to improve quality
• More computational resources are needed
• Computational eﬃciency and low parameter count are still important
• Mobile vision and embedded systems
• Big Data
11

Going Deeper with Convolutions [6]
• The Inception module solves this problem making a better use of the
computing resources
• Proposed in 2014 by Christian Szegedy and other Google researchers
• Used in the GoogLeNet architecture that won both the ILSVRC
2014 classiﬁcation and detection challanges
12

Inception module I
• Visual information is processed at various scales and then aggregated
• Since pooling operations are beneficial in CNNs, a parallel pooling
path has been added
• Problems:
• 3x3 and 5x5 convolutions can be very expensive on top of a layer
with lots of filters
• The number of filters substantially increases for each Inception layer
added, leading to a computational blow up 13

Inception module II
• Adding the 1x1 convolutions before the bigger convolutions reduces
dimensionality
• The same is done after the pooling layer
14

GoogLeNet I
• GoogLeNet is a particular incarnation of the Inception architecture
• 22 convolutional layers (27 including pooling)
• 9 Inception modules
• 2 auxiliary classifiers to solve the vanishing gradient problem and for
regularization
• Designed with computational efficiency in mind
• Inference can be run on devices with limited computational
resources, especially memory
• 7 of these networks used in an ensemble for the ILSVRC 2014
classification task
15

GoogLeNet - Training
• Trained with the DistBelief distributed machine learning system
• Asynchronous stochastic gradient descent with 0.9 momentum
• Image sampling methods have changed many times before the
competition
• Converged models were trained on with other options
• Models were trained on crops of different size
• There isn’t a definitive guidance to the most effective single way to
train these networks
18

GoogLeNet - ILSVRC 2014 Results
Classiﬁcation (above) and object detection (below) results.
19

DeepDream
Google’s DeepDream uses a GoogLeNet to produce “machine dreams”
20

Inception-v2 and Inception-v3
• The Inception module authors later presented new optimized
versions of the architecture, called Inception-v2 and Inception-v3 [7]
• They managed to signiﬁcantly improve GoogLeNet ILSVRC 2014
results
• The improvements were based on various key principles:
• Avoid representational bottlenecks
• Spatial aggregation on lower dimensional embeddings doesn’t usually
induce relevant losses in representational power
• Balance the width and depth of the network
21

Convolution factorization I
• Factorizing convolutions allows to reduce the number of parameters
while not loosing much expressiveness
• For example 5x5 convolutions can be factorized into a pair of 3x3
convolutions
• It is also possible to factorize a NxN convolutions into a 1xN and a
Nx1 convolutions
22

Convolution factorization II
The original Inception module (left) and the new factorized module
(right).
23

Efficient grid size reduction - problem
• Suppose we want to pass from a d × d grid with k filters to a d
2 × d
2
grid with 2k filters
• We need to compute a stride-1 convolution and then a pooling
• Computational cost dominated by convolutions: 2d2
k2
operations
• Inverting the order, the number of operations is reduced to 2(d
2 )2
k2
,
but we violate the bottleneck principle
24

Eﬃcient grid size reduction - solution
• The solution is an Inception module with convolution and pooling
blocks with stride 2
• Computationally eﬃcient and no representational bottleneck
introduced
25

The new architecture
• Using various modiﬁed Inception modules, here is the new
Inception-v2 architecture
26

Inception-v2: modules used
n = 7
27

Inception-v2: training and observations
• The network was trained on the ILSVRC 2012 images using
stochastic gradient descent and the TensorFlow library
• Experimental testings proved the two auxiliary classifiers to have less
impact on the training convergence than expected
• In the early training phases, the model performance was not affected
by the presence of the auxiliary classifiers: they only improved the
performance near the end of training
• Removing the lower auxiliary classifier didn’t have any effect
• The main classifier performs better if batch normalization or dropout
are added to the auxiliary ones
• The model was also trained and tested on smaller receptive fields
with only a small loss of top-1 accuracy (76.6% for 299x299 RF vs.
75.2% on 79x79 RF). Important for post-classification of detection
28

Inception-v2 to Inception-v3 results (single model)
• Each row’s Inception-v2 model adds a feature with respect to the
previous row’s model
• The last line’s model is referred to as the Inception-v3 model
29

Inception-v3 vs other models (single and ensemble)
Single model results Ensemble results
• On the ILSVRC 2012 dataset, there is a signiﬁcant improvement
versus state-of-the-art models, both with a single model and with an
ensemble of models
• Note that the ensemble errors here are validation errors (except for
the one marked with ’*’, that is a test error)
30

Semantic segmentation
• Image segmentation is the process of partitioning an image in
multiple segments (set of pixels or super-pixels)
• Semantic segmentation is the partitioning of an image into
semantically meaningful parts and to classify each part into one of
the pre-determined classes
• It’s possible to achieve the same result with pixel-wise
classiﬁcation, i.e. assigning a class to each pixel
32

Fully convolutional networks
• Shelhamer et al. [8] showed that fully convolutional networks trained
pixels-to-pixels exceed the state-of-the-art in semantic segmentation
• The fully convolutional networks they proposed take input of
arbitrary size and produce same-sized output to make dense
predictions
33

Convolutionalization of a classic net I
• Typical recognition nets (AlexNet, GoogLeNet, etc.) take ﬁxed-sized
inputs and produce non-spatial outputs
• The fully connected layers have ﬁxed dimensions and drop the
spatial coordinates
• However we can view these fully connected layers as convolutions
that cover their entire input regions
34

Convolutionalization of a classic net II
• These fully convolutional networks take input of any size and output
classiﬁcations map
• The resulting maps are equivalent to the evaluation of the original
network on particular input patches
• The new network is more than 5 times faster than the original
network both at learning time and at inference time (considering a
10x10 output grid)
• Note that the output dimensions are typically reduced by
subsampling
• So output interpolation is needed to obtain dense predictions
• The interpolation is obtained through backwards convolutions
35

Backwards strided convolution
Upsampling from 3x3 grid to 5x5
36

Architecture I
• Coarse and local information is fused combining lower and higher
layers
• 3 network types with diﬀerent layers fused were tested
37

Architecture II
• 3 proven classification architectures were transformed to fully
convolutional: AlexNet, VGG16 and GoogLeNet
• Each net’s final classifier layer is discarded and all the fully
connected layers are converted to convolutions
• A 1x1 convolution with 21 channels (the number of classes in the
PASCAL VOC 2011 dataset) is added to the end, followed by a
backwards convolution layer
38

Architecture III
• The original nets were first pre-trained using image classification
• Then they were transformed to fully convolutional for fine tuning
using whole images (using SGD with momentum)
• The best results were obtained with FCN-VGG16
• Training on whole images proved to be as effective as sampling
patches
39

Architecture comparison
• The ﬁrst models (FCN-32s) didn’t fuse diﬀerent layers, but the
resulting output is very coarse
• They then fused lower layers with the last one (as shown earlier) to
obtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for
FCN-32s)
40

Results comparison I
• The model reaches state-of-the-art performance on semantic
segmentation
• Also the model is much faster at inference time than previous
architectures
41

Hypercolumns I
• The last layer of a CNN captures general features of the image, but
is too coarse spatially to allow precise localization
• Earlier layers instead may be precise in localization but will not
capture semantics
• Hariharan et al. [9] presented the hypercolumn concept, which puts
togheter the information from both higher and lower layers to obtain
better results on 3 ﬁne-grained localization tasks:
• Simultaneous detection and segmentation
• Keypoint localization
• Part labeling
44

Hypercolumns II
• The hypercolumn corresponding to a given input location is deﬁned
as the outputs of all units above that location at all layers of the
CNN, stacked into one vector
45

Problem setting I
• Input: a set of detections (subjected to non-maximum suppression),
each with a bounding box, a category label and a score
• According to the task we are performing for each detection we want:
• segment out the object
• segment its parts
• predict its keypoints
• Whichever the task, the bounding boxes are slightly expanded and a
50x50 heatmap is predicted on each of them
46

Problem setting II
• The information encoded in each heatmap and the number of
heatmaps depend on the chosen task:
• For segmentation, the heatmap encodes the probability that a
particular location is inside the object
• For part labeling a separate heatmap is predicted for each part,
where each heatmap is the probability a location belongs to that part
• For keypoint localization a separate heatmap is predicted for each
keypoint, with each heatmap encoding the probability that the
keypoint is at a particular location
• The heatmaps are ﬁnally resized to the size of the expanded
bounding boxes
• So all the tasks are solved assigning a probability to each of the
50x50 locations
47

Problem setting III
• For each of the 50x50 locations and for each category a classifier
should be trained
• But doing so has 3 problems:
• The amount of data that each classifier sees during training is
heavily reduced
• Training so many classifiers is computationally expensive
• While the classifier should vary according to the location, to adjacent
pixels should be classified similarly
• The solution is to train a coarse K × K (usually K = 5 or K = 10)
grid of classifiers and interpolate between them
48

Network architecture
conv conv conv
upsample upsample upsample
sigmoid
classifier
interpolation
Note: inverting the order of upsampling and convolutions (that calculate
the K × K grids) and computing them separately for each of the 3
combined layers allows to reduce computational cost
49

Bounding box reﬁning
• A special technique is used to improve the box selection, called
rescoring
50

Keypoint prediction results
52

Conclusion
• We have seen how the Inception modules allow to train deeper and
better networks in a computationally efficient manner
• We have then observed how to transform a classification CNN into a
fully convolutional network for pixel-wise classification
• We have learned the hypercolumn technique to combine high and
low level information to improve the accuracy on various fine-grained
localization tasks
55

Thank you for your patience! :)
56

References I
[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to
handwritten zip code recognition,” Neural Computation, vol. 1(4),
pp. 541–551, 1989.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86,
pp. 2278–2324, 1998.
[3] A. W. Harley, “An interactive node-link visualization of convolutional
neural networks,” in ISVC, pp. 867–877, 2015.
[4] A. Kurenkov, “A ’brief’ history of neural nets and deep learning, part
4.” http://www.andreykurenkov.com/writing/
a-brief-history-of-neural-nets-and-deep-learning-part-4/.
57

References II
[5] A. Krizhevsky, I. Sutskever, , and G. Hinton, “Imagenet classiﬁcation
with deep convolutional neural networks,” Advances in Neural
Information Processing Systems, vol. 25, pp. 1106–1114, 2012.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” CoRR, vol. abs/1409.4842, 2014.
[7] C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, and Z. Wojna,
“Rethinking the inception architecture for computer vision,” CoRR,
vol. abs/1512.00567, 2015.
[8] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” CoRR, vol. abs/1605.06211, 2016.
58

References III
[9] B. Hariharan, P. A. Arbel´aez, R. B. Girshick, and J. Malik,
“Hypercolumns for object segmentation and ﬁne-grained
localization,” CoRR, vol. abs/1411.5752, 2014.
59

Modern Convolutional Neural Network techniques for image segmentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Modern Convolutional Neural Network techniques for image segmentation

Ähnlich wie Modern Convolutional Neural Network techniques for image segmentation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Modern Convolutional Neural Network techniques for image segmentation