2. Disclaimer
• This is not in-depth session on neural
networks and ML – see link at the end
• Due to amount of concepts it’s rather fast
paced presentation and doesn’t cover all
topics on mentioned subjects
• I hope this will give you some insight and you
can pick topics of your interests and learn in
free time
3. What is it all about?
• Problems Computer Vision tries to address
• Neural Networks (in 60 seconds)
• Basic building blocks of CNNs
• History and evolution of CNNs
• How and why CNNs work on digital images?
• Limitations of CNNs
• Object detection
4. Computer vision
• Make computer understand and make sense
of digital images (photos, videos and data
from sensors)
• Object classification
• Object detection
• Video tracking
• Object segmentation
5. Computer vision - applications
• Medical images classification
• Land surveying
• Surveillance systems – threat detection
• Autonomous vehicles
• Military systems – i.e. guided missiles
9. How do they learn?
• Get some data you are going to use for training:
– X (input), Y (ground truth for given X)
– Split into training/validation, or
training/validation/test
• Feed “train X” into network
• Compare result with ground truth and adjust
weights through back-propagation until network
is optimized
• Occasionally test on validation data
• After training is completed do final check on test
data
11. Digit recognition with NN
• Digit recognition – common test for ML
algorithms (NNs, SVM, etc.)
• MNIST – database of handwritten digits
– Gray scale, 28x28 pixels
13. Preparing network layout
• MNIST – reshape 28x28 image into 1x784 (input
size)
• 10 digits to classify == 10 classes of NN
• I.e. reshaping 3x3 into 1x9
14. Sample NN for MNIST
https://www.tensorflow.org/get_started/mnist/beginners
0.01
0.02
0.02
0.06
0.07
0.1
0.08
0.12
0.13
0.39
15. Problems with FC NN for images?
• Number of connections (weights) grows really
fast – becomes memory and computationally
expensive
• I.e. if we add 1 hidden layer with 392 units to
MNIST NN we will increase number of weights
from 7840 to 315168
• What if we operate on larger images with more
classes? i.e. 50x50 with 20 classes and larger
hidden layer? 2.5k input, 1.25k hidden layer and
20 classes output would give 3.15M weights
16. Problems with FC NN for images?
“Images are highly spatially correlated thus looking at a pixel
at time is wasteful.”
17. What CNNs do differently?
• Networks designed specifically for images*
• They look at regions of images
• Input images have 3D shape [WxHxD]
• They use convolutions to extract spatial
features (i.e. edges, blobs of colors, shapes)
18. CNN the genesis
• First LeNet developed in 1988
• LeNet5 – pioneering CNN used back in 1994 for
digit classification – MNIST error rate of 0.95%*
• Neural network designed specifically for digit
classification
• State of the art at that time, widely used by
banks, etc.
• What’s so special? The way it looks at input
19. LeNet5
• How feature maps are produced?
– By sliding filters over input and convolving them.
– I.e. filter ‘5’ over 28x28 digit with padding of 2,
stride 1
21. CNN – building blocks
• CNN hyper parameters:
– K: number of filters
• Input is passed through filters and produce ‘feature maps’ – more
filters will learn different properties of images
• Also know as kernels or weights
– F: spatial extent of a filter
• Portion of input the filter is looking at, i.e. 2x2, 3x3, etc. patch of an
input
– S: stride:
• How far are we sliding a filter in each step over input? Smaller strides
will capture mode details of an image
– P: zero padding
• Add ‘0’ at the edge of image – allows to preserve spatial size after
CONV; also allows to better capture features on the edge of images
• ReLU – activation function following Conv layer, breaks
linearity of linear functions (there are other than ReLU)
23. CNNs: other ops
• Max pooling: dimension reduction
Applying 2x2 max-pool filter to 4x4 matrix:
24. Advancements in CNNs?
• First LeNet developed in 1988; LeNet5 1994
(paper published in 1998)
• Since then, not much has advanced until 2010
when Dan Claudiu Ciresan and Jurgen
Schmidhuber published one of first
implementations of NN on GPU (GTX 280)
• NNs are generally quite expensive to
computer and running it on GPUs enable to
train more complex models.
25. How to compare software solutions in
image recognition space?
• ILSVRC - ImageNet Large Scale Visual
Recognition Competition
• Teams come up with solution to classify
objects in digital images
• Around 1.2m images to train on; 1000 classes
• Scores based on error rate of top-5 and top-1
classifications
27. Advancements in CNNs?
• ILSVRC 2012 – AlexNet a CNN wins competition
with top-5 error @15.3% (runner-up with score
26.2%)
– It has 650k neurons and 60m parameters
– Trained on 2x GTX 580 for 5 to 6 days
28. Advancements in CNNs?
• (2013) OverFeat wins competition with top-5
error @13.6% on ImageNet
– Uses much smaller kernels 3x3
– Is deeper that previous network
29. Advancements in CNNs?
• (2014) VGG top-5 error @7.1% (2nd place)
– Learns bounding boxes – i.e. object location in image
– Much deeper than previous networks. 140m parameters
31. Advancements in CNNs?
• (2014) GoogLeNet top-5 error @6.67%
– 22 layers! But only 60m parameters
– Introduces concept of ‘inception’. Applies filters of
different sizes to capture invariances at different scales
33. Advancements in CNNs?
• (2015) ResNet top-5 error @3.57%
– How many layers? 152!
– Concept of shortcut connections – prevents
information from being forgotten
34. Training deep CNNs
• Large nets can take weeks of training on multiple
high end GPUs to learn on ImageNet sets
• The more data the better. How to expand
learning set?
– Randomly flip left-right, bottom-top
– Randomly crop
– Introduce noise
– Modify colors
– Rotate
– All of above at the same time
35. Accuracy vs. performance
• Ok, so models are getting bigger (more ops,
more weights) and more accurate. How about
getting faster?
• MobileNet: recent network by Google
researchers with reduced number of
connections which outperforms simpler
networks (ref at the end)
36. Training deep CNNs
• That’s lots of hassle to train a net, the must be
another way? Right?
• Use pre-trained networks; fine-tune last few FC
layers.
• With pre-trained nets you are reusing snapshot of
kernels (weights) from conv layers.
• Fine-tuning works because conv layers (closer to
input) learned reusable patterns (edges, colors,
textures, etc.) which are applicable across
multiple computer vision categories
38. There are multiple approaches
• R-CNN (R for Region based), Fast R-CNN, Faster R-
CNN
– Basically, has two outputs – “regression head” and
“classification head”
• YOLO (you look only once), YOLO v2 (YOLO9000)
– Apply single CNN to full image, divide it into small
regions and predict probabilities of object classes for
each box. Then pass through accuracy threshold
https://www.youtube.com/watch?v=VOC3huqHrss
39. • OverFeat: https://arxiv.org/pdf/1312.6229.pdf
• VGG in large scale img setting
https://arxiv.org/pdf/1409.1556.pdf
• CNN benchmarks: https://github.com/jcjohnson/cnn-
benchmarks
• MobileNets: https://arxiv.org/pdf/1704.04861.pdf
• Convolution explained:
http://www.songho.ca/dsp/convolution/convolution2d_exa
mple.html
• Pre-trained nets (scores are different to what has been
achieved in competition due to variances in training
process):
https://github.com/tensorflow/models/tree/master/slim
40. • Good ML course:
https://www.coursera.org/learn/machine-
learning
----- Meeting Notes (28/06/17 21:53) -----
how challanges of computer vision can be addressed with convoluted neural networks
Vision – being able to see.
Video tracking – how long are you waiting in a queue? Which isles are you visiting in a shop?
Apple vs orange; where is walley? Video-tracking yolo. Threat detection toy gun vs real.
Vision – being able to see.
Video tracking – how long are you waiting in a queue? Which isles are you visiting in a shop?
Apple vs orange; where is walley? Video-tracking yolo. Threat detection toy gun vs real.
Before we move onto ML and NN, let’s touch on computer vision. What sort of problems is it trying to address.
Activation functions – break linearity of NNs allowing to learn more complex functions tha
This network tries to learn different aspects of input images. How? Apply random filters to extract interesting features – kernels (weights) are convolving over input and produce feature maps