Introduction to computer vision

Introduction to computer vision
with Convoluted Neural Networks
Marcin Jedyk

Disclaimer
• This is not in-depth session on neural
networks and ML – see link at the end
• Due to amount of concepts it’s rather fast
paced presentation and doesn’t cover all
topics on mentioned subjects
• I hope this will give you some insight and you
can pick topics of your interests and learn in
free time

What is it all about?
• Problems Computer Vision tries to address
• Neural Networks (in 60 seconds)
• Basic building blocks of CNNs
• History and evolution of CNNs
• How and why CNNs work on digital images?
• Limitations of CNNs
• Object detection

Computer vision
• Make computer understand and make sense
of digital images (photos, videos and data
from sensors)
• Object classification
• Object detection
• Video tracking
• Object segmentation

Computer vision - applications
• Medical images classification
• Land surveying
• Surveillance systems – threat detection
• Autonomous vehicles
• Military systems – i.e. guided missiles

Computer vision - applications

Neural Networks in 60 seconds
• Collection of connected units
• Compositions of functions
• Building blocks:
– Input layer
– Hidden layers*
– Output layer
– Weights
– Activation functions**
– Biases
* 0 or more (seriously)

Neural Networks in 60 (more) seconds

How do they learn?
• Get some data you are going to use for training:
– X (input), Y (ground truth for given X)
– Split into training/validation, or
training/validation/test
• Feed “train X” into network
• Compare result with ground truth and adjust
weights through back-propagation until network
is optimized
• Occasionally test on validation data
• After training is completed do final check on test
data

Training: train vs. test error

Digit recognition with NN
• Digit recognition – common test for ML
algorithms (NNs, SVM, etc.)
• MNIST – database of handwritten digits
– Gray scale, 28x28 pixels

Preparing network layout
• MNIST – reshape 28x28 image into 1x784 (input
size)
• 10 digits to classify == 10 classes of NN
• I.e. reshaping 3x3 into 1x9

Sample NN for MNIST
https://www.tensorflow.org/get_started/mnist/beginners
0.01
0.02
0.02
0.06
0.07
0.1
0.08
0.12
0.13
0.39

Problems with FC NN for images?
• Number of connections (weights) grows really
fast – becomes memory and computationally
expensive
• I.e. if we add 1 hidden layer with 392 units to
MNIST NN we will increase number of weights
from 7840 to 315168
• What if we operate on larger images with more
classes? i.e. 50x50 with 20 classes and larger
hidden layer? 2.5k input, 1.25k hidden layer and
20 classes output would give 3.15M weights

Problems with FC NN for images?
“Images are highly spatially correlated thus looking at a pixel
at time is wasteful.”

What CNNs do differently?
• Networks designed specifically for images*
• They look at regions of images
• Input images have 3D shape [WxHxD]
• They use convolutions to extract spatial
features (i.e. edges, blobs of colors, shapes)

CNN the genesis
• First LeNet developed in 1988
• LeNet5 – pioneering CNN used back in 1994 for
digit classification – MNIST error rate of 0.95%*
• Neural network designed specifically for digit
classification
• State of the art at that time, widely used by
banks, etc.
• What’s so special? The way it looks at input

LeNet5
• How feature maps are produced?
– By sliding filters over input and convolving them.
– I.e. filter ‘5’ over 28x28 digit with padding of 2,
stride 1

Convolution? WTF
• Multiply and add like-wise elements of input and flipped kernel

CNN – building blocks
• CNN hyper parameters:
– K: number of filters
• Input is passed through filters and produce ‘feature maps’ – more
filters will learn different properties of images
• Also know as kernels or weights
– F: spatial extent of a filter
• Portion of input the filter is looking at, i.e. 2x2, 3x3, etc. patch of an
input
– S: stride:
• How far are we sliding a filter in each step over input? Smaller strides
will capture mode details of an image
– P: zero padding
• Add ‘0’ at the edge of image – allows to preserve spatial size after
CONV; also allows to better capture features on the edge of images
• ReLU – activation function following Conv layer, breaks
linearity of linear functions (there are other than ReLU)

How do trained kernels look like?

CNNs: other ops
• Max pooling: dimension reduction
Applying 2x2 max-pool filter to 4x4 matrix:

Advancements in CNNs?
• First LeNet developed in 1988; LeNet5 1994
(paper published in 1998)
• Since then, not much has advanced until 2010
when Dan Claudiu Ciresan and Jurgen
Schmidhuber published one of first
implementations of NN on GPU (GTX 280)
• NNs are generally quite expensive to
computer and running it on GPUs enable to
train more complex models.

How to compare software solutions in
image recognition space?
• ILSVRC - ImageNet Large Scale Visual
Recognition Competition
• Teams come up with solution to classify
objects in digital images
• Around 1.2m images to train on; 1000 classes
• Scores based on error rate of top-5 and top-1
classifications

• ILSVRC 2012 – AlexNet a CNN wins competition
with top-5 error @15.3% (runner-up with score
26.2%)
– It has 650k neurons and 60m parameters
– Trained on 2x GTX 580 for 5 to 6 days

• (2013) OverFeat wins competition with top-5
error @13.6% on ImageNet
– Uses much smaller kernels 3x3
– Is deeper that previous network

• (2014) VGG top-5 error @7.1% (2nd place)
– Learns bounding boxes – i.e. object location in image
– Much deeper than previous networks. 140m parameters

• (2014) GoogLeNet top-5 error @6.67%
– 22 layers! But only 60m parameters
– Introduces concept of ‘inception’. Applies filters of
different sizes to capture invariances at different scales

• (2015) ResNet top-5 error @3.57%
– How many layers? 152!
– Concept of shortcut connections – prevents
information from being forgotten

Training deep CNNs
• Large nets can take weeks of training on multiple
high end GPUs to learn on ImageNet sets
• The more data the better. How to expand
learning set?
– Randomly flip left-right, bottom-top
– Randomly crop
– Introduce noise
– Modify colors
– Rotate
– All of above at the same time

Accuracy vs. performance
• Ok, so models are getting bigger (more ops,
more weights) and more accurate. How about
getting faster?
• MobileNet: recent network by Google
researchers with reduced number of
connections which outperforms simpler
networks (ref at the end)

Training deep CNNs
• That’s lots of hassle to train a net, the must be
another way? Right?
• Use pre-trained networks; fine-tune last few FC
layers.
• With pre-trained nets you are reusing snapshot of
kernels (weights) from conv layers.
• Fine-tuning works because conv layers (closer to
input) learned reusable patterns (edges, colors,
textures, etc.) which are applicable across
multiple computer vision categories

There are multiple approaches
• R-CNN (R for Region based), Fast R-CNN, Faster R-
CNN
– Basically, has two outputs – “regression head” and
“classification head”
• YOLO (you look only once), YOLO v2 (YOLO9000)
– Apply single CNN to full image, divide it into small
regions and predict probabilities of object classes for
each box. Then pass through accuracy threshold
https://www.youtube.com/watch?v=VOC3huqHrss

• OverFeat: https://arxiv.org/pdf/1312.6229.pdf
• VGG in large scale img setting
https://arxiv.org/pdf/1409.1556.pdf
• CNN benchmarks: https://github.com/jcjohnson/cnn-
benchmarks
• MobileNets: https://arxiv.org/pdf/1704.04861.pdf
• Convolution explained:
http://www.songho.ca/dsp/convolution/convolution2d_exa
mple.html
• Pre-trained nets (scores are different to what has been
achieved in competition due to variances in training
process):
https://github.com/tensorflow/models/tree/master/slim

• Good ML course:
https://www.coursera.org/learn/machine-
learning

Introduction to computer vision

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to computer vision

Ähnlich wie Introduction to computer vision (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to computer vision

Hinweis der Redaktion