4. Neural Networks
• Neurons are connected via
synapse
• A neuron receives activations
from other neurons
• When these activations reach a
threshold, it fires an electronics
signal to other neurons http://en.wikipedia.org/wiki/Neuron
6. Multi-Layer Perceptron
• Number of input nodes = number of features
• 1 hidden layer
• Full connection between consecutive layers
• 2-class
• 1 output node with class label +1 and -1 or 0
• more than 2 classes
• Number of output nodes = number of classes (WHY?)
• Each output node is associated with a single class
• Classification rule: put the input pattern in the class whose
corresponding output node gives maximal value
11. Gradient
• Gradient of a function f having a set of
parameters θ is a vector of partial derivatives
of f with respect to each parameter θi
• Gradient indicates the direction of change for
θ which greatest increases f(θ)
• Question: How can we use the Gradient to train
the neural networks?
12. Error Back-propagation (Backprop)
• Squared error
• Gradient points to direction of increased E -> So what?
• Use chain rule
• h(x) = f(g(x))
• h'(x) = ?
14. Backprop (2)
• Calculation backward from output layers
• Change objective function affects only output nodes
• Cross entropy for classification problem
• Change activation function affects partial diff sl
j
• Can be applied to any NN structures
16. Optimizers
• SGD (stochastic gradient descent)
• Adadelta: adaptive learning rate method
• RMSprop: divide the gradient by running average of its
recent magnitude
• Adam: use first and second moment to scale the gradient
• Nadam: Adam RMSprop with Nesterov momentum
• ….
17. Neural Network for Machine Learning
Lecture 6c: The momentum method
G. Hinton
https://www.youtube.com/watch?v=8yg2mRJx-z4
18. ex2: MNIST with MLP
Load MNIST data
bitmap 28x28 pixels = 784 features
10 classes
19.
20. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning
Applied to Document Recognition", Proc. Of the IEEE, November 1998
MLP
CNN
21. Convolutional NN (CNN)
• Image Convolution
• Feature extractor + Classifier
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to
Document Recognition", Proc. Of the IEEE, November 1998
22. Conv2D
• Input shape = (nchannels, w, w)
• format = ‘channels_first’
• Conv2D( filters, kernel_size, padding, strides, data_format)
• filters = number of convolution kernels = number of output channels
• kernel_size: ex (3,3)
• padding: ‘same’, ‘valid’
• strides: how to slide the kernel across the image
• ex: Conv2D(10, (3,3), padding=‘same’)
• Output shape = (10, w,w)
23. ex3: MNIST with CNN
BatchNormalization: normalize outputs of a layer
MaxPooling: reduce size of the feature maps
alternative AveragePooling
Is this larger or smaller than previous MLP?
ReLU(x) = max{ 0 , x }
25. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning
Applied to Document Recognition", Proc. Of the IEEE, November 1998
MLP
CNN
1.2 million params + preprocessing
26. • CNN achieves better results compared to MLP
• MLP structure is simpler but uses larger number
of parameters
• CNN is deeper
• CNN is slower -> GPU since 2010,2012-now!!
• CNN top layers are MLP
• MLP with deeper structure yields bad result ->
gradient vanishing problem
27. Gradient Vanishing
• Backprop
• Solutions
• Pretraining: stack of RBMs, stack of Autoencoders
• CNN: shared weights
• ReLU: set f’ = 1 or 0
<1
G. Hinton, S. Osindero, and Y.-W. Teh, “A Fast Learning Algorithm for Deep Belief Nets",
In Neural Computation, 18, pp. 1527-1554, 2006
28. Labeled faces in the wild
Y. Sun et al. Deep Learning Face Representation from Predicting 10,000 classes, CVPR 2014
http://vis-www.cs.umass.edu/lfw/
29. ex4: DeepID network
• Sun et al. used 60 of these NNs.
• Each one is trained on part of the
face images
Y. Sun et al. Deep Learning Face Representation from Predicting 10,000 classes, CVPR 2014
30. • Same network structure but trained on different dataset yields
different performance
• Now you should know how to construct basic CNN
• The design of the CNN structure is an open problem
• The number of kernels
• The depth of the network
• Reduce size or not
• Activations
• …
33. Some results
• GIST (global feature) + SVM (RBF):
85.57%
• SIFT (local feature) + BoF + SVM
(Histogram intersection): 89.69%
• SIFT + SPM (spatial pyramid
matching) + LLC (locality-constrained
linear coding) + SVM (linear): 91.48%
• CNN (AlexNet trained on other
dataset) + SVM (linear): 93.58%
S. Lazebnik et al. “Beyond bag of Features: spatial Pyramid Matching for
Natural Scene Categories”, CVPR 2006
J. Wang et al. “Locality-constrained Linear Coding for Image Classification”, CVPR 2010
D. Lowe “Object recognition from local scale-invariant features“, ICCV 1999
35. Overfit problem
• Understand VS memorizing
• Rule of thumbs: when #params is large the model tends to be overfit
• Problem: NN structure is defined first!
• Solution
• Early stopping
• Weights decay
• Optimal brain damage
• Drop-out ~simulated brain damage
• Increase training data
validation error
training error
iterations
36.
37. Inception module
Original design Variations
Explore various methods to
combine convolutions
C. Szegedy et al. “Rethinking the Inception Architecture for Computer Vision”, CVPR 2016
38. Xception module
• Convolution kernel finds correlation in 3D (2D spatial + 1D channel)
• Inception hyp: cross-channel and spatial correlations can be
decoupled
• Extreme case: Xception module
F. Chollet “Xception: Deep Learning with Depthwise Separable Convolutions”, arXiv:1610.02357
39. ResNet
• Add skip connections
• Weights of unnecessary blocks will be driven
toward zeros -> residual
• Acts like mixture of several shallower networks