This presentation begins with explaining the basic algorithms of machine learning and using the same concepts, discusses in detail 2 supervised learning/deep learning algorithms - Artificial neural nets and Convolutional Neural Nets. The relationship between Artificial neural nets and basic machine learning algorithms such as logistic regression and soft max is also explored. For hands on the implementation of ANN's and CNN's on MNIST dataset is also explained.
2. Getting the Code:
You can download the code used in the session from
https://github.com/Manuyashchaudhary/AnnSession.git
For setting up the environment, you can use the docker command - docker pull
manuyash/annsession
3. Structure of the Session:
1. Linear, Logistic Regression & Multinomial Logistic Regression - Quick Review
2. Structure of ANN’s
3. Learning in ANN’s
4. Implementation of ANN on MNIST
5. Layers of CNN - Convolutional, RELU, Pooling
6. Parameter Sharing
7. Implementation on MNIST
4. Data:
Data is the key to solving any machine learning or artificial intelligence
problem.
Divide the dataset it into training and testing data.
The algorithm learns on training data and is evaluated on testing data.
5. MNIST:
Images are of uniform colour and size.
Images are of size 28X28 pixels.
Each pixel in the image can take value
from 0 to 256.
Properties: Fixed space, constant
background.
6. Image as Input:
This is a 14X14
representation of pixel
intensities.
The numbers represent the
fraction of pixel covered
by the digit.
7. Starting from the basics: Linear Regression
To predict values of a variable (Y)
which is dependent on multiple
other independent variables
(X).
Fit a line to your training data
which generalises the data
well.
y= mx +c; m is the slope and c is
the constant.
8. Fitting a line:
For regression, y = x𝛃
To find the line of best fit, find 𝛃’s
such that the line generalises
the data well. .
Cost/Error = Mean Squared
Error
Minimise the error to get line of
best fit.
10. Key Takeaways:
You train an algorithm on training data.
You are training to find best possible combination of parameters.
There is a cost function.
The cost function can be minimised by gradient descent.
Perform a dot product between parameters and the unseen data and
get the output.
11. Logistic Regression:
1. It is used for a binary
classification.
2. Outputs probability of a class.
3. It is based on a function called
sigmoid function.
13. Key Takeaways:
Perform dot product between parameters and input; apply sigmoid function.
There is a cost function.
The cost function can be minimised for the parameters using gradient descent.
What happens when the variable which is being predicted has more than 2
classes?
14. Softmax:
1. Multinomial logistic regression.
2. Sigmoid → Softmax
3. For each output/category, we compute a
weighted sum of the x’s, add a bias, and
then apply softmax.
4. Softmax is defined as =
Where xj is the summation of the jth neuron.
16. Softmax to Neural Networks:
Neural networks can be considered as a network of multiple logistic
regression computations stacked in parallel and in series to each
other.
Softmax is applied to the last layer in neural networks.
You will understand these points as we go forward.
17. Structure of Artificial Neural Networks:
1. Input, output and hidden layers.
2. The layers are arranged
sequentially and each layer is
made of multiple neurons.
3. Input layer number of neurons =
length of input vector
4. Output layer number of neurons
= number of classes in the
dependent or target variable.
18. Assumptions, Parameters and Hyperparameters
Neurons within a layer do not interact with each other.
The layers are densely connected.
For every neuron there is a bias and for every interaction there is a weight.
The parameters are the weights and biases of the network which are to be
found.
19. Working of a Neuron:
The input to a neuron is the
weighted sum of inputs + bias.
Activation function is used to
introduce non-linearity in the
network.
If the output is greater than a
threshold, the neuron will fire,
otherwise not.
20. Message passing in NN’s
Consider 2 hidden layers ‘l-1’ and ‘l’.
Total number of interconnections would be (n1 x n2).
Output of layer ‘l-1’ is al-1 = (al-1 1, al-1 2, ….., al-1 n1)
The output of layer l-1 will be the input to layer l.
The input of layer l will be w.al-1, where w is n2xn1 matrix.
Add bias to this.
Output of layer l is activation applied to the input to layer
l.
21. Activation Functions:
Activation functions are used to introduce non linearity in the network.
It makes neural networks more compact.
Due to this nonlinearity neural networks can approximate any measurable
function.
Activation functions should be smooth.
23. Sigmoid vs Tanh vs RELU
Sigmoid suffers from the problem of vanishing gradient.
Tanh has stronger gradients thus reducing the problem of vanishing gradient.
RELU reduces the likelihood of the problem of vanishing gradient and also
introduces sparsity to the network but it tend to blow up the activation.
In practice, tanh usually work the best.
24. Formulating a Cost Function:
What is the output of the network?
aL(w, b, xi)
Actual Output - yi
Cost Incurred on 1 Data point - Ci(aL
i, yi)
Total Cost - Sum of Individual costs over all data point
∑Ci
Training Problem - min ∑Ci for w, b
Optimisation problem - Find best combination of w, b.
25. Different Cost functions:
Quadratic Cost - C = ∑(aL
i - yi)2 / n
Cross Entropy - C = -∑[ yi ln aL
i + (1-yi )ln 1-aL
i ] = -∑∑(yij ln(aL
ij)
Exponential - C = Τ exp ( ∑(aL
i - yi)2 / Τ)
Kullback - Leibler Divergence = DKL(P∥Q) = DKL(yi∥aL
i ) = ∑yi ln(yi/aL
i)
26. Cost Function: Properties
1. We must be able to write the cost function C as an average over cost
functions Cx of individual training examples x.
a. It allows the gradient of a single training example to be calculated.
2. The cost function should not be dependent on any activations of the network
other than the final output values aL.
a. This is a sort of a restriction so as we can backpropagate.
27. Minimising the Cost Function:
1. To minimise the cost function, gradient descent is used.
a. Why gradient descent and why not calculus?
b. What is gradient descent?
28. Gradient Descent - An Example -
Consider a simple function C(v1, v2).
For small changes Δv1 and Δv2 , the cost function
changes as follows: ΔC = (∂C/∂v1)Δv1 +
(∂C/∂v2)Δv2
ΔC = ▽C.Δv; ΔC is the change in the cost, ▽C is the
gradient and Δv is the change in the parameters.
If Δv = - η▽C then ΔC = -η▽C2 i.e. the cost will
always decrease.
v → v’ = v - η▽C - update rule.
29. Stochastic Gradient Descent:
Recall cost function assumption 1, cost function can be written as an average of cost over individual
training examples.
To compute the gradient ∇C, you need to compute the gradient ∇Cx of each training input separately and
then average them ∇C = (∑∇Cx/ n) .
SGD - Calculate the gradient of a small mini batch of say m inputs and use that as an estimator of the
true gradient. Carry out updates using the gradient of the minibatch.
Carry out mini batch update for another randomly chosen batch and so on until the training inputs are
exhausted. This completes one epoch of learning.
Repeat for the specified number of epochs.
New hyperparameters - size of mini batch, number of epochs, learning rate.
30. But how do we calculate these gradients: Backpropagation
Let’s denote the input to any layer l as zl
zl= wl al-1 + bl
Output of layer l = al = f(wl al-1 + bl) = f( zl)
Let’s consider the output of the output layer L for an individual input xi.
aL(w,b,xi) = σ(zL(w,b,xi))
So the cost is Ci = (aL - yi)2.
So now we want to differentiate this cost to calculate the gradient of Ci.
Instead of differentiating w.r.t a we will differentiate w.r.t z .
31. Let’s do some math:
Chain Rule:
ꝺ (aL(w,b,xi)- yi)2/ ꝺ zL = 2 (aL(w,b,xi) - yi) ꝺ (σ(zL(w,b,xi))) /ꝺ zL
Sigmoid function :
32. Gradient of Ci w.r.t. zL (output layer)
d(aL(w,b,xi)- yi)2/ dzL = 2 (aL(w,b,xi) - yi) σ(zL(w,b,xi)) (1 - σ(zL(w,b,xi)))
zL(w,b,xi) = wL aL-1(w1:L-1, b1:L-1, xi) + bL
Aim is to calculate: d(aL(w,b,xi)- yi)2/dwL and d(aL(w,b,xi)- yi)2/dbL
dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
33. Gradients of the Last Layer:
dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
But we need the gradients of the entire network to update the weights and
biases of the network. How does gradients of the last layer help?
Backpropagation: propagating the network through the last layer
Gradients of any layer can be written in the form of gradients of the next layer.
Therefore, knowing the gradients of layer L, you can write the gradients of layer
L-1 in terms of gradients of layer L (which are known to you), gradients of
layer L-2 in terms of gradients of layer L-1 and so on.
34. Gradient of Ci w.r.t. zl
dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
Zl = wlal-1 + bl
dCi/ dwl = dCi/ dzl * dzl /dwl and dCi/ dbl= dCi/ dzl * dzl /dbl
To find out - dCi/ dzl - gradient of Ci w.r.t the cumulative input of the layer l.
35. Sorry!! A little more math:
dCi/ dzl = dCi/ dal * dal/ dzl
dCi/ dzl = σ(zl)(1 -σ(zl)) * d Ci/ ꝺal
dCi/ dzl = σ(zl)(1 -σ(zl)) *dCi/ dzl+1 *dzl+1/ dal
zl+1 = wl+!al + bl+1
39. How are CNN’s different from ANN’s:ow
ConvNet architectures make the explicit assumption that input are images.
Their architecture is different from feedforward neural networks to make them
more efficient by reducing the number of parameters to be learnt.
In ANN, if you have a 150x150x3 image, each neuron in the first hidden layer
will have 67500 weights to learn.
ConvNets have 3D input of neurons and the neurons in a layer are only
connected to a small region of the layer before it.
40. ConvNets:
The neurons in the layers of ConvNet are
arranged in 3 dimensions: height, width,
depth.
Depth here is not the depth of the entire
network. It refers to the third dimension of
the layers and hence a third dimension of
the activation volumes.
In essence, a ConvNet is made of layers
which have a simple API - transform a 3-D
input volume to a 3-D output volume with
some differentiable function which may or
may not have parameters.
41. Layers of a ConvNet:
Input Layer - 28 x 28 x 1 for MNIST (grayscale)
Convolution Layer - The neurons are connected to small/local regions in the input.
ReLU - This is the activation layer in the network.
Pool - This layer will downsample on the width and the height but not on depth. It applies a fixed function
such as max() or mean() etc.
Fully Connected Layer - Just like the last layer in feedforward networks, this layer too will give us the
class scores arranged across the depth dimension.
Convolutional and fully connected layers have parameters, relu and pooling layers do not.
Convolutional, fully connected and pooling layers have additional hyperparameters too.
44. Convolutional (Conv) Layer
In machine learning, this flashlight is known as a filter and the
region it shines over is known as the receptive field/size of the
filter.
Filter:
Is an array of numbers also known as the
weights/parameters (learnable).
A very important dimension to note is the depth of the
filter. The depth must be equal to the depth of the input
volume.
This filter will now slide/convolve over the rest of the image
performing element wise multiplication, summing it up and
returning a single number.
After convolving over the entire image, you will get an activation
map which is 2-D. For a 32x32x3 dimension input, using a
45. Filters:
You can increase the number of filters on the input
volume to increase the number of activation maps
you get. Each filter gives you an activation map.
Each activation map you get, tries to lean a different
aspect of the image such as an edge, a blotch of
colour etc.
If on a 32x32x3 image volume, you implement 12 filters
of size 5x5x3, then the first convolutional layer will
have dimension 28x28x12 under certain conditions.
Basically, the more the filters, the better the spatial
dimensions are preserved.
Now, let’s talk about the certain conditions
46. Filter - Hyperparameters:
The size of the filter is a hyperparameter.
When you apply a filter to an input volume, the output volume of the filter depends on 3 hyperparameters
- fibre/depth, stride and zero-padding.
Fibre/Depth - This refers to the number of filters applied to the input volume, each learning to recognise
something different in the input.
Stride - It refers to the pace at which the filter moves through the input volume. If stride is 1, we move
the filters one pixel at a time.
Zero-padding - To control the size of the output volume, the input volume can be padded with zeroes
around the border.
Given these hyperparameters the size of the output volume is given as: ((W-F+2P)/S) +1; W is the size
of input, F is the filter size, S is the stride.
47. Number of Parameters:
Consider the output volume of 28x28x12 of the first convolutional layer,
which was achieved by applying a filter of 5x5x3 on the input of
32x32x3.
Number of neurons in the layer = 28*28*12 = 9408
Each neuron has 5*5*3 = 75 weights and 1 bias i.e. 76 parameters.
Overall number of parameters of the first layer = 9408*76 = 715,008
48. Parameter Sharing:
Simple Assumption: For each activation map or depth slice, constrain the
neurons to use the same weights and bias. Therefore, for the last example,
the conv layer will have a set of 12 unique weights and 12 biases.
Overall number of weights in the first layer: 12*5*5*3 = 900
Total parameters = 900 +12 = 912
49. Conv Layer Summary:
Accepts an input volume of W1 x H1 x D1
Needs 4 Hyperparameters:
Number of filters = K
Filter/receptive field size = F
Stride = S
Zero-Padding = P
The output is of size W2 x H2 x D2 where;
W2 = ((W1-F+2P)/S) +1
50. ReLU (Rectified Linear Units) Layer:
Just like in feedforward neural networks, the purpose of an activation layer in
Convnet is to introduce nonlinearity.
You can also use activations like tanh or sigmoid but ReLU works better in
practice. Why?
It reduces the number of parameters in the network, thus enabling it to learn
faster.
Also, it helps us reduce the problem of vanishing gradients.
51. Pooling Layers:
Pooling layer is also known as the downsampling layer.
Use - Progressively reduce the spatial size of its input, thus reducing the
number of parameters in the network and controlling overfitting.
The pooling layer works on each depth slice independently, resizes it using the
mathematical operation specified such as MAX or Avg. etc.
Most common form of pooling is to apply a 2x2 filter with a stride of 2 on the
input volume.
The depth dimension will remain unchanged.
52. Pooling Layer:
A 2x2 filter with stride
as 2 applying a
MAX function.
As you can see the
number of
parameters are
reduced by 75%.
53. Fully Connected Layer, Dropout and Normalisation:
Fully connected layers in Convnet is exactly the same to the layers in
feedforward neural networks. The last layer in Convnet is a fully connected
softmax layer.
Dropout Layers: These layers have a very specific function in convnet which is
to avoid overfitting.
A random fraction of activations are ‘dropped out’ or set to 0 during the forward
pass by this layer. This makes sure that the network is not mugging the
training data. This layer is only used during training time.
Normalisation - These layers are usually added after pooling layers to normalise
the output of the pooling layer.
54. Transfer Learning:
Transfer learning is the process of taking a pre-trained model whose weights
and parameters have been trained on a large dataset, and fine-tune it
according to your own data.
You remove the last layer of the network and replace it with you own classifier;
and keep the weights and biases of the rest of the network constant.
The idea is that the pre-trained model will act as a feature extractor.
Minimising the cost - with grasdiesnt descent you calculate small changes in the cost function for small changes in the values of the parameters. and you try to update the parameters in such a way that the changes in the cost function are decreasing it value each time. basically taking it towards a minimum. which is the goal
Minimising the cost - with calculus you need to differentiate the cost w.r.t. The parameters. with gradient descent you calculate small changes in the cost function for small changes in the values of the parameters. and you try to update the parameters in such a way that the changes in the cost function are decreasing it value each time. basically taking it towards a minimum. which is the goal.
So basically to find Y, we are performing a dot product between the Beta’s i.e. the coefficients/weights and the input data and adding it up.
The aim of any ml algorithm in predictive analytics, is to predict something and to get the best possible predictions, you define and minimise the cost function.
2. And on the basis of that probability you can classify if the data belongs to positive or negative class.
4. If the probability is greater than 0.5 then you classify the input to positive class, otherwise to the negative class.
One of the nice properties of logistic regression is that the logistic cost function (or max-entropy) is convex,
This is like a single layer perceptron. Explain the equations of perceptrons and the concept of firing.