SlideShare ist ein Scribd-Unternehmen logo
Modern Convolutional Neural Network
techniques for image segmentation
Deep Learning Journal Club
Gioele Ciaparrone
Michele Curci
November 30, 2016
University of Salerno
1. Introduction
2. The Inception architecture
3. Fully convolutional networks
4. Hypercolumns
5. Conclusion
CNN recap
• Sequence of convolutional and pooling layers
• Rectifier activation function
• Fully connected layers at the end
• Softmax function for classification
Convolution I
Convolution II
Valid padding (left) and same padding (right) convolutions
LeNet-5 (1989-1998)
• First CNN (1989) proven to work well, used for handwritten Zip
code recognition [1]
• Refined through the years until the LeNet-5 version (1998) [2]
LeNet-5 interactive visualization [3]
It’s possible to interact with the network in 3D, manually drawing a digit
to be classified, clicking on the neurons to get info about the parameters
and the connected units, or rotating and zooming the network:
AlexNet (2012) [5]
• After a long hiatus in which deep learning was ignored [4], they
received attention once again after Alex Krizhevsky overwhelmingly
won the ILSVRC in 2012 with AlexNet
• Structure very similar to LeNet-5, but with some new key insights:
very efficient GPU implementation, ReLU neurons and dropout
The Inception architecture
• Increasing model size tends to improve quality
• More computational resources are needed
• Computational efficiency and low parameter count are still important
• Mobile vision and embedded systems
• Big Data
Going Deeper with Convolutions [6]
• The Inception module solves this problem making a better use of the
computing resources
• Proposed in 2014 by Christian Szegedy and other Google researchers
• Used in the GoogLeNet architecture that won both the ILSVRC
2014 classification and detection challanges
Inception module I
• Visual information is processed at various scales and then aggregated
• Since pooling operations are beneficial in CNNs, a parallel pooling
path has been added
• Problems:
• 3x3 and 5x5 convolutions can be very expensive on top of a layer
with lots of filters
• The number of filters substantially increases for each Inception layer
added, leading to a computational blow up 13
Inception module II
• Adding the 1x1 convolutions before the bigger convolutions reduces
• The same is done after the pooling layer
GoogLeNet I
• GoogLeNet is a particular incarnation of the Inception architecture
• 22 convolutional layers (27 including pooling)
• 9 Inception modules
• 2 auxiliary classifiers to solve the vanishing gradient problem and for
• Designed with computational efficiency in mind
• Inference can be run on devices with limited computational
resources, especially memory
• 7 of these networks used in an ensemble for the ILSVRC 2014
classification task
GoogLeNet II
GoogLeNet III
GoogLeNet - Training
• Trained with the DistBelief distributed machine learning system
• Asynchronous stochastic gradient descent with 0.9 momentum
• Image sampling methods have changed many times before the
• Converged models were trained on with other options
• Models were trained on crops of different size
• There isn’t a definitive guidance to the most effective single way to
train these networks
GoogLeNet - ILSVRC 2014 Results
Classification (above) and object detection (below) results.
Google’s DeepDream uses a GoogLeNet to produce “machine dreams”
Inception-v2 and Inception-v3
• The Inception module authors later presented new optimized
versions of the architecture, called Inception-v2 and Inception-v3 [7]
• They managed to significantly improve GoogLeNet ILSVRC 2014
• The improvements were based on various key principles:
• Avoid representational bottlenecks
• Spatial aggregation on lower dimensional embeddings doesn’t usually
induce relevant losses in representational power
• Balance the width and depth of the network
Convolution factorization I
• Factorizing convolutions allows to reduce the number of parameters
while not loosing much expressiveness
• For example 5x5 convolutions can be factorized into a pair of 3x3
• It is also possible to factorize a NxN convolutions into a 1xN and a
Nx1 convolutions
Convolution factorization II
The original Inception module (left) and the new factorized module
Efficient grid size reduction - problem
• Suppose we want to pass from a d × d grid with k filters to a d
2 × d
grid with 2k filters
• We need to compute a stride-1 convolution and then a pooling
• Computational cost dominated by convolutions: 2d2
• Inverting the order, the number of operations is reduced to 2(d
2 )2
but we violate the bottleneck principle
Efficient grid size reduction - solution
• The solution is an Inception module with convolution and pooling
blocks with stride 2
• Computationally efficient and no representational bottleneck
The new architecture
• Using various modified Inception modules, here is the new
Inception-v2 architecture
Inception-v2: modules used
n = 7
Inception-v2: training and observations
• The network was trained on the ILSVRC 2012 images using
stochastic gradient descent and the TensorFlow library
• Experimental testings proved the two auxiliary classifiers to have less
impact on the training convergence than expected
• In the early training phases, the model performance was not affected
by the presence of the auxiliary classifiers: they only improved the
performance near the end of training
• Removing the lower auxiliary classifier didn’t have any effect
• The main classifier performs better if batch normalization or dropout
are added to the auxiliary ones
• The model was also trained and tested on smaller receptive fields
with only a small loss of top-1 accuracy (76.6% for 299x299 RF vs.
75.2% on 79x79 RF). Important for post-classification of detection
Inception-v2 to Inception-v3 results (single model)
• Each row’s Inception-v2 model adds a feature with respect to the
previous row’s model
• The last line’s model is referred to as the Inception-v3 model
Inception-v3 vs other models (single and ensemble)
Single model results Ensemble results
• On the ILSVRC 2012 dataset, there is a significant improvement
versus state-of-the-art models, both with a single model and with an
ensemble of models
• Note that the ensemble errors here are validation errors (except for
the one marked with ’*’, that is a test error)
Fully convolutional networks
Semantic segmentation
• Image segmentation is the process of partitioning an image in
multiple segments (set of pixels or super-pixels)
• Semantic segmentation is the partitioning of an image into
semantically meaningful parts and to classify each part into one of
the pre-determined classes
• It’s possible to achieve the same result with pixel-wise
classification, i.e. assigning a class to each pixel
Fully convolutional networks
• Shelhamer et al. [8] showed that fully convolutional networks trained
pixels-to-pixels exceed the state-of-the-art in semantic segmentation
• The fully convolutional networks they proposed take input of
arbitrary size and produce same-sized output to make dense
Convolutionalization of a classic net I
• Typical recognition nets (AlexNet, GoogLeNet, etc.) take fixed-sized
inputs and produce non-spatial outputs
• The fully connected layers have fixed dimensions and drop the
spatial coordinates
• However we can view these fully connected layers as convolutions
that cover their entire input regions
Convolutionalization of a classic net II
• These fully convolutional networks take input of any size and output
classifications map
• The resulting maps are equivalent to the evaluation of the original
network on particular input patches
• The new network is more than 5 times faster than the original
network both at learning time and at inference time (considering a
10x10 output grid)
• Note that the output dimensions are typically reduced by
• So output interpolation is needed to obtain dense predictions
• The interpolation is obtained through backwards convolutions
Backwards strided convolution
Upsampling from 3x3 grid to 5x5
Architecture I
• Coarse and local information is fused combining lower and higher
• 3 network types with different layers fused were tested
Architecture II
• 3 proven classification architectures were transformed to fully
convolutional: AlexNet, VGG16 and GoogLeNet
• Each net’s final classifier layer is discarded and all the fully
connected layers are converted to convolutions
• A 1x1 convolution with 21 channels (the number of classes in the
PASCAL VOC 2011 dataset) is added to the end, followed by a
backwards convolution layer
Architecture III
• The original nets were first pre-trained using image classification
• Then they were transformed to fully convolutional for fine tuning
using whole images (using SGD with momentum)
• The best results were obtained with FCN-VGG16
• Training on whole images proved to be as effective as sampling
Architecture comparison
• The first models (FCN-32s) didn’t fuse different layers, but the
resulting output is very coarse
• They then fused lower layers with the last one (as shown earlier) to
obtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for
Results comparison I
• The model reaches state-of-the-art performance on semantic
• Also the model is much faster at inference time than previous
Results comparison II
Hypercolumns I
• The last layer of a CNN captures general features of the image, but
is too coarse spatially to allow precise localization
• Earlier layers instead may be precise in localization but will not
capture semantics
• Hariharan et al. [9] presented the hypercolumn concept, which puts
togheter the information from both higher and lower layers to obtain
better results on 3 fine-grained localization tasks:
• Simultaneous detection and segmentation
• Keypoint localization
• Part labeling
Hypercolumns II
• The hypercolumn corresponding to a given input location is defined
as the outputs of all units above that location at all layers of the
CNN, stacked into one vector
Problem setting I
• Input: a set of detections (subjected to non-maximum suppression),
each with a bounding box, a category label and a score
• According to the task we are performing for each detection we want:
• segment out the object
• segment its parts
• predict its keypoints
• Whichever the task, the bounding boxes are slightly expanded and a
50x50 heatmap is predicted on each of them
Problem setting II
• The information encoded in each heatmap and the number of
heatmaps depend on the chosen task:
• For segmentation, the heatmap encodes the probability that a
particular location is inside the object
• For part labeling a separate heatmap is predicted for each part,
where each heatmap is the probability a location belongs to that part
• For keypoint localization a separate heatmap is predicted for each
keypoint, with each heatmap encoding the probability that the
keypoint is at a particular location
• The heatmaps are finally resized to the size of the expanded
bounding boxes
• So all the tasks are solved assigning a probability to each of the
50x50 locations
Problem setting III
• For each of the 50x50 locations and for each category a classifier
should be trained
• But doing so has 3 problems:
• The amount of data that each classifier sees during training is
heavily reduced
• Training so many classifiers is computationally expensive
• While the classifier should vary according to the location, to adjacent
pixels should be classified similarly
• The solution is to train a coarse K × K (usually K = 5 or K = 10)
grid of classifiers and interpolate between them
Network architecture
conv conv conv
upsample upsample upsample
Note: inverting the order of upsampling and convolutions (that calculate
the K × K grids) and computing them separately for each of the 3
combined layers allows to reduce computational cost
Bounding box refining
• A special technique is used to improve the box selection, called
SDS results
Keypoint prediction results
Part labeling results
• We have seen how the Inception modules allow to train deeper and
better networks in a computationally efficient manner
• We have then observed how to transform a classification CNN into a
fully convolutional network for pixel-wise classification
• We have learned the hypercolumn technique to combine high and
low level information to improve the accuracy on various fine-grained
localization tasks
Thank you for your patience! :)
References I
[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to
handwritten zip code recognition,” Neural Computation, vol. 1(4),
pp. 541–551, 1989.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86,
pp. 2278–2324, 1998.
[3] A. W. Harley, “An interactive node-link visualization of convolutional
neural networks,” in ISVC, pp. 867–877, 2015.
[4] A. Kurenkov, “A ’brief’ history of neural nets and deep learning, part
References II
[5] A. Krizhevsky, I. Sutskever, , and G. Hinton, “Imagenet classification
with deep convolutional neural networks,” Advances in Neural
Information Processing Systems, vol. 25, pp. 1106–1114, 2012.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” CoRR, vol. abs/1409.4842, 2014.
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the inception architecture for computer vision,” CoRR,
vol. abs/1512.00567, 2015.
[8] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” CoRR, vol. abs/1605.06211, 2016.
References III
[9] B. Hariharan, P. A. Arbel´aez, R. B. Girshick, and J. Malik,
“Hypercolumns for object segmentation and fine-grained
localization,” CoRR, vol. abs/1411.5752, 2014.

Weitere ähnliche Inhalte

Was ist angesagt?

Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Ashray Bhandare
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
Hichem Felouat
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Richard Kuo
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
Jinwon Lee
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Ashray Bhandare
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
milad abbasi
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
Jörgen Sandig
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Sujit Pal
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
Kasun Chinthaka Piyarathna
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
Cnn method
Cnn methodCnn method
Cnn method
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Pradnya Saval
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
Edge AI and Vision Alliance
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Sangwoo Mo
Deep learning for image super resolution
Deep learning for image super resolutionDeep learning for image super resolution
Deep learning for image super resolution
Prudhvi Raj

Was ist angesagt? (20)

Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
Machine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural NetworkMachine Learning - Convolutional Neural Network
Machine Learning - Convolutional Neural Network
Convolutional Neural Network (CNN) - image recognition
Convolutional Neural Network (CNN)  - image recognitionConvolutional Neural Network (CNN)  - image recognition
Convolutional Neural Network (CNN) - image recognition
MobileNet - PR044
MobileNet - PR044MobileNet - PR044
MobileNet - PR044
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
Single Image Super Resolution Overview
Single Image Super Resolution OverviewSingle Image Super Resolution Overview
Single Image Super Resolution Overview
Cnn method
Cnn methodCnn method
Cnn method
Deep Learning - RNN and CNN
Deep Learning - RNN and CNNDeep Learning - RNN and CNN
Deep Learning - RNN and CNN
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)convolutional neural network (CNN, or ConvNet)
convolutional neural network (CNN, or ConvNet)
Generative Models for General Audiences
Generative Models for General AudiencesGenerative Models for General Audiences
Generative Models for General Audiences
Deep learning for image super resolution
Deep learning for image super resolutionDeep learning for image super resolution
Deep learning for image super resolution

Ähnlich wie Modern Convolutional Neural Network techniques for image segmentation

PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Jinwon Lee
Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
taeseon ryu
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
Sushant Gautam
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
taeseon ryu
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces
Jinwon Lee

Ähnlich wie Modern Convolutional Neural Network techniques for image segmentation (20)

PR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignPR-144: SqueezeNext: Hardware-Aware Neural Network Design
PR-144: SqueezeNext: Hardware-Aware Neural Network Design
Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]Handwritten Digit Recognition and performance of various modelsation[autosaved]
Handwritten Digit Recognition and performance of various modelsation[autosaved]
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
240429_Thuy_Labseminar[Simplifying and Empowering Transformers for Large-Grap...
Once-for-All: Train One Network and Specialize it for Efficient Deployment
 Once-for-All: Train One Network and Specialize it for Efficient Deployment Once-for-All: Train One Network and Specialize it for Efficient Deployment
Once-for-All: Train One Network and Specialize it for Efficient Deployment
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
NS-CUK Seminar: S.T.Nguyen, Review on "Hierarchical Graph Convolutional Netwo...
Convolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular ArchitecturesConvolutional Neural Networks : Popular Architectures
Convolutional Neural Networks : Popular Architectures
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digitsNVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
NVIDIA 深度學習教育機構 (DLI): Medical image segmentation using digits
ConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explainedConvNeXt: A ConvNet for the 2020s explained
ConvNeXt: A ConvNet for the 2020s explained
04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx04 Deep CNN (Ch_01 to Ch_3).pptx
04 Deep CNN (Ch_01 to Ch_3).pptx
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptxEfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.pptx
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable C...
PR243: Designing Network Design Spaces
PR243: Designing Network Design SpacesPR243: Designing Network Design Spaces
PR243: Designing Network Design Spaces

Kürzlich hochgeladen

Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf

Kürzlich hochgeladen (20)

Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf

Modern Convolutional Neural Network techniques for image segmentation

  • 1. Modern Convolutional Neural Network techniques for image segmentation Deep Learning Journal Club Gioele Ciaparrone Michele Curci November 30, 2016 University of Salerno
  • 2. Index 1. Introduction 2. The Inception architecture 3. Fully convolutional networks 4. Hypercolumns 5. Conclusion 2
  • 4. CNN recap • Sequence of convolutional and pooling layers • Rectifier activation function • Fully connected layers at the end • Softmax function for classification 4
  • 6. Convolution II Valid padding (left) and same padding (right) convolutions 6
  • 7. LeNet-5 (1989-1998) • First CNN (1989) proven to work well, used for handwritten Zip code recognition [1] • Refined through the years until the LeNet-5 version (1998) [2] 7
  • 8. LeNet-5 interactive visualization [3] It’s possible to interact with the network in 3D, manually drawing a digit to be classified, clicking on the neurons to get info about the parameters and the connected units, or rotating and zooming the network: 8
  • 9. AlexNet (2012) [5] • After a long hiatus in which deep learning was ignored [4], they received attention once again after Alex Krizhevsky overwhelmingly won the ILSVRC in 2012 with AlexNet • Structure very similar to LeNet-5, but with some new key insights: very efficient GPU implementation, ReLU neurons and dropout 9
  • 11. Motivations • Increasing model size tends to improve quality • More computational resources are needed • Computational efficiency and low parameter count are still important • Mobile vision and embedded systems • Big Data 11
  • 12. Going Deeper with Convolutions [6] • The Inception module solves this problem making a better use of the computing resources • Proposed in 2014 by Christian Szegedy and other Google researchers • Used in the GoogLeNet architecture that won both the ILSVRC 2014 classification and detection challanges 12
  • 13. Inception module I • Visual information is processed at various scales and then aggregated • Since pooling operations are beneficial in CNNs, a parallel pooling path has been added • Problems: • 3x3 and 5x5 convolutions can be very expensive on top of a layer with lots of filters • The number of filters substantially increases for each Inception layer added, leading to a computational blow up 13
  • 14. Inception module II • Adding the 1x1 convolutions before the bigger convolutions reduces dimensionality • The same is done after the pooling layer 14
  • 15. GoogLeNet I • GoogLeNet is a particular incarnation of the Inception architecture • 22 convolutional layers (27 including pooling) • 9 Inception modules • 2 auxiliary classifiers to solve the vanishing gradient problem and for regularization • Designed with computational efficiency in mind • Inference can be run on devices with limited computational resources, especially memory • 7 of these networks used in an ensemble for the ILSVRC 2014 classification task 15
  • 18. GoogLeNet - Training • Trained with the DistBelief distributed machine learning system • Asynchronous stochastic gradient descent with 0.9 momentum • Image sampling methods have changed many times before the competition • Converged models were trained on with other options • Models were trained on crops of different size • There isn’t a definitive guidance to the most effective single way to train these networks 18
  • 19. GoogLeNet - ILSVRC 2014 Results Classification (above) and object detection (below) results. 19
  • 20. DeepDream Google’s DeepDream uses a GoogLeNet to produce “machine dreams” 20
  • 21. Inception-v2 and Inception-v3 • The Inception module authors later presented new optimized versions of the architecture, called Inception-v2 and Inception-v3 [7] • They managed to significantly improve GoogLeNet ILSVRC 2014 results • The improvements were based on various key principles: • Avoid representational bottlenecks • Spatial aggregation on lower dimensional embeddings doesn’t usually induce relevant losses in representational power • Balance the width and depth of the network 21
  • 22. Convolution factorization I • Factorizing convolutions allows to reduce the number of parameters while not loosing much expressiveness • For example 5x5 convolutions can be factorized into a pair of 3x3 convolutions • It is also possible to factorize a NxN convolutions into a 1xN and a Nx1 convolutions 22
  • 23. Convolution factorization II The original Inception module (left) and the new factorized module (right). 23
  • 24. Efficient grid size reduction - problem • Suppose we want to pass from a d × d grid with k filters to a d 2 × d 2 grid with 2k filters • We need to compute a stride-1 convolution and then a pooling • Computational cost dominated by convolutions: 2d2 k2 operations • Inverting the order, the number of operations is reduced to 2(d 2 )2 k2 , but we violate the bottleneck principle 24
  • 25. Efficient grid size reduction - solution • The solution is an Inception module with convolution and pooling blocks with stride 2 • Computationally efficient and no representational bottleneck introduced 25
  • 26. The new architecture • Using various modified Inception modules, here is the new Inception-v2 architecture 26
  • 28. Inception-v2: training and observations • The network was trained on the ILSVRC 2012 images using stochastic gradient descent and the TensorFlow library • Experimental testings proved the two auxiliary classifiers to have less impact on the training convergence than expected • In the early training phases, the model performance was not affected by the presence of the auxiliary classifiers: they only improved the performance near the end of training • Removing the lower auxiliary classifier didn’t have any effect • The main classifier performs better if batch normalization or dropout are added to the auxiliary ones • The model was also trained and tested on smaller receptive fields with only a small loss of top-1 accuracy (76.6% for 299x299 RF vs. 75.2% on 79x79 RF). Important for post-classification of detection 28
  • 29. Inception-v2 to Inception-v3 results (single model) • Each row’s Inception-v2 model adds a feature with respect to the previous row’s model • The last line’s model is referred to as the Inception-v3 model 29
  • 30. Inception-v3 vs other models (single and ensemble) Single model results Ensemble results • On the ILSVRC 2012 dataset, there is a significant improvement versus state-of-the-art models, both with a single model and with an ensemble of models • Note that the ensemble errors here are validation errors (except for the one marked with ’*’, that is a test error) 30
  • 32. Semantic segmentation • Image segmentation is the process of partitioning an image in multiple segments (set of pixels or super-pixels) • Semantic segmentation is the partitioning of an image into semantically meaningful parts and to classify each part into one of the pre-determined classes • It’s possible to achieve the same result with pixel-wise classification, i.e. assigning a class to each pixel 32
  • 33. Fully convolutional networks • Shelhamer et al. [8] showed that fully convolutional networks trained pixels-to-pixels exceed the state-of-the-art in semantic segmentation • The fully convolutional networks they proposed take input of arbitrary size and produce same-sized output to make dense predictions 33
  • 34. Convolutionalization of a classic net I • Typical recognition nets (AlexNet, GoogLeNet, etc.) take fixed-sized inputs and produce non-spatial outputs • The fully connected layers have fixed dimensions and drop the spatial coordinates • However we can view these fully connected layers as convolutions that cover their entire input regions 34
  • 35. Convolutionalization of a classic net II • These fully convolutional networks take input of any size and output classifications map • The resulting maps are equivalent to the evaluation of the original network on particular input patches • The new network is more than 5 times faster than the original network both at learning time and at inference time (considering a 10x10 output grid) • Note that the output dimensions are typically reduced by subsampling • So output interpolation is needed to obtain dense predictions • The interpolation is obtained through backwards convolutions 35
  • 36. Backwards strided convolution Upsampling from 3x3 grid to 5x5 36
  • 37. Architecture I • Coarse and local information is fused combining lower and higher layers • 3 network types with different layers fused were tested 37
  • 38. Architecture II • 3 proven classification architectures were transformed to fully convolutional: AlexNet, VGG16 and GoogLeNet • Each net’s final classifier layer is discarded and all the fully connected layers are converted to convolutions • A 1x1 convolution with 21 channels (the number of classes in the PASCAL VOC 2011 dataset) is added to the end, followed by a backwards convolution layer 38
  • 39. Architecture III • The original nets were first pre-trained using image classification • Then they were transformed to fully convolutional for fine tuning using whole images (using SGD with momentum) • The best results were obtained with FCN-VGG16 • Training on whole images proved to be as effective as sampling patches 39
  • 40. Architecture comparison • The first models (FCN-32s) didn’t fuse different layers, but the resulting output is very coarse • They then fused lower layers with the last one (as shown earlier) to obtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for FCN-32s) 40
  • 41. Results comparison I • The model reaches state-of-the-art performance on semantic segmentation • Also the model is much faster at inference time than previous architectures 41
  • 44. Hypercolumns I • The last layer of a CNN captures general features of the image, but is too coarse spatially to allow precise localization • Earlier layers instead may be precise in localization but will not capture semantics • Hariharan et al. [9] presented the hypercolumn concept, which puts togheter the information from both higher and lower layers to obtain better results on 3 fine-grained localization tasks: • Simultaneous detection and segmentation • Keypoint localization • Part labeling 44
  • 45. Hypercolumns II • The hypercolumn corresponding to a given input location is defined as the outputs of all units above that location at all layers of the CNN, stacked into one vector 45
  • 46. Problem setting I • Input: a set of detections (subjected to non-maximum suppression), each with a bounding box, a category label and a score • According to the task we are performing for each detection we want: • segment out the object • segment its parts • predict its keypoints • Whichever the task, the bounding boxes are slightly expanded and a 50x50 heatmap is predicted on each of them 46
  • 47. Problem setting II • The information encoded in each heatmap and the number of heatmaps depend on the chosen task: • For segmentation, the heatmap encodes the probability that a particular location is inside the object • For part labeling a separate heatmap is predicted for each part, where each heatmap is the probability a location belongs to that part • For keypoint localization a separate heatmap is predicted for each keypoint, with each heatmap encoding the probability that the keypoint is at a particular location • The heatmaps are finally resized to the size of the expanded bounding boxes • So all the tasks are solved assigning a probability to each of the 50x50 locations 47
  • 48. Problem setting III • For each of the 50x50 locations and for each category a classifier should be trained • But doing so has 3 problems: • The amount of data that each classifier sees during training is heavily reduced • Training so many classifiers is computationally expensive • While the classifier should vary according to the location, to adjacent pixels should be classified similarly • The solution is to train a coarse K × K (usually K = 5 or K = 10) grid of classifiers and interpolate between them 48
  • 49. Network architecture conv conv conv upsample upsample upsample sigmoid classifier interpolation Note: inverting the order of upsampling and convolutions (that calculate the K × K grids) and computing them separately for each of the 3 combined layers allows to reduce computational cost 49
  • 50. Bounding box refining • A special technique is used to improve the box selection, called rescoring 50
  • 55. Conclusion • We have seen how the Inception modules allow to train deeper and better networks in a computationally efficient manner • We have then observed how to transform a classification CNN into a fully convolutional network for pixel-wise classification • We have learned the hypercolumn technique to combine high and low level information to improve the accuracy on various fine-grained localization tasks 55
  • 56. Thank you for your patience! :) 56
  • 57. References I [1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1(4), pp. 541–551, 1989. [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, pp. 2278–2324, 1998. [3] A. W. Harley, “An interactive node-link visualization of convolutional neural networks,” in ISVC, pp. 867–877, 2015. [4] A. Kurenkov, “A ’brief’ history of neural nets and deep learning, part 4.” a-brief-history-of-neural-nets-and-deep-learning-part-4/. 57
  • 58. References II [5] A. Krizhevsky, I. Sutskever, , and G. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, pp. 1106–1114, 2012. [6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014. [7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015. [8] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” CoRR, vol. abs/1605.06211, 2016. 58
  • 59. References III [9] B. Hariharan, P. A. Arbel´aez, R. B. Girshick, and J. Malik, “Hypercolumns for object segmentation and fine-grained localization,” CoRR, vol. abs/1411.5752, 2014. 59