SlideShare a Scribd company logo
1 of 55
Classification of Handwritten Digits -
Artificial and Convolutional Networks
Getting the Code:
You can download the code used in the session from
https://github.com/Manuyashchaudhary/AnnSession.git
For setting up the environment, you can use the docker command - docker pull
manuyash/annsession
Structure of the Session:
1. Linear, Logistic Regression & Multinomial Logistic Regression - Quick Review
2. Structure of ANN’s
3. Learning in ANN’s
4. Implementation of ANN on MNIST
5. Layers of CNN - Convolutional, RELU, Pooling
6. Parameter Sharing
7. Implementation on MNIST
Data:
Data is the key to solving any machine learning or artificial intelligence
problem.
Divide the dataset it into training and testing data.
The algorithm learns on training data and is evaluated on testing data.
MNIST:
Images are of uniform colour and size.
Images are of size 28X28 pixels.
Each pixel in the image can take value
from 0 to 256.
Properties: Fixed space, constant
background.
Image as Input:
This is a 14X14
representation of pixel
intensities.
The numbers represent the
fraction of pixel covered
by the digit.
Starting from the basics: Linear Regression
To predict values of a variable (Y)
which is dependent on multiple
other independent variables
(X).
Fit a line to your training data
which generalises the data
well.
y= mx +c; m is the slope and c is
the constant.
Fitting a line:
For regression, y = x𝛃
To find the line of best fit, find 𝛃’s
such that the line generalises
the data well. .
Cost/Error = Mean Squared
Error
Minimise the error to get line of
best fit.
Gradient Descent Intuition:
Key Takeaways:
You train an algorithm on training data.
You are training to find best possible combination of parameters.
There is a cost function.
The cost function can be minimised by gradient descent.
Perform a dot product between parameters and the unseen data and
get the output.
Logistic Regression:
1. It is used for a binary
classification.
2. Outputs probability of a class.
3. It is based on a function called
sigmoid function.
Sigmoid Function:
Linear Regression - y(x) = 𝛃x
Logistic - log(p/1-p) = 𝛃x
y(x) = p = 1/ [1 + e-βx ]
Cost function is :
Key Takeaways:
Perform dot product between parameters and input; apply sigmoid function.
There is a cost function.
The cost function can be minimised for the parameters using gradient descent.
What happens when the variable which is being predicted has more than 2
classes?
Softmax:
1. Multinomial logistic regression.
2. Sigmoid → Softmax
3. For each output/category, we compute a
weighted sum of the x’s, add a bias, and
then apply softmax.
4. Softmax is defined as =
Where xj is the summation of the jth neuron.
Working of Softmax:
Softmax to Neural Networks:
Neural networks can be considered as a network of multiple logistic
regression computations stacked in parallel and in series to each
other.
Softmax is applied to the last layer in neural networks.
You will understand these points as we go forward.
Structure of Artificial Neural Networks:
1. Input, output and hidden layers.
2. The layers are arranged
sequentially and each layer is
made of multiple neurons.
3. Input layer number of neurons =
length of input vector
4. Output layer number of neurons
= number of classes in the
dependent or target variable.
Assumptions, Parameters and Hyperparameters
Neurons within a layer do not interact with each other.
The layers are densely connected.
For every neuron there is a bias and for every interaction there is a weight.
The parameters are the weights and biases of the network which are to be
found.
Working of a Neuron:
The input to a neuron is the
weighted sum of inputs + bias.
Activation function is used to
introduce non-linearity in the
network.
If the output is greater than a
threshold, the neuron will fire,
otherwise not.
Message passing in NN’s
Consider 2 hidden layers ‘l-1’ and ‘l’.
Total number of interconnections would be (n1 x n2).
Output of layer ‘l-1’ is al-1 = (al-1 1, al-1 2, ….., al-1 n1)
The output of layer l-1 will be the input to layer l.
The input of layer l will be w.al-1, where w is n2xn1 matrix.
Add bias to this.
Output of layer l is activation applied to the input to layer
l.
Activation Functions:
Activation functions are used to introduce non linearity in the network.
It makes neural networks more compact.
Due to this nonlinearity neural networks can approximate any measurable
function.
Activation functions should be smooth.
Activation Functions:
Sigmoid = 1/ 1 + e-x
Tanh = (ez - e-z )/ (ez +e-z )
RELU = max(0,a)
Sigmoid vs Tanh vs RELU
Sigmoid suffers from the problem of vanishing gradient.
Tanh has stronger gradients thus reducing the problem of vanishing gradient.
RELU reduces the likelihood of the problem of vanishing gradient and also
introduces sparsity to the network but it tend to blow up the activation.
In practice, tanh usually work the best.
Formulating a Cost Function:
What is the output of the network?
aL(w, b, xi)
Actual Output - yi
Cost Incurred on 1 Data point - Ci(aL
i, yi)
Total Cost - Sum of Individual costs over all data point
∑Ci
Training Problem - min ∑Ci for w, b
Optimisation problem - Find best combination of w, b.
Different Cost functions:
Quadratic Cost - C = ∑(aL
i - yi)2 / n
Cross Entropy - C = -∑[ yi ln aL
i + (1-yi )ln 1-aL
i ] = -∑∑(yij ln(aL
ij)
Exponential - C = Τ exp ( ∑(aL
i - yi)2 / Τ)
Kullback - Leibler Divergence = DKL(P∥Q) = DKL(yi∥aL
i ) = ∑yi ln(yi/aL
i)
Cost Function: Properties
1. We must be able to write the cost function C as an average over cost
functions Cx of individual training examples x.
a. It allows the gradient of a single training example to be calculated.
2. The cost function should not be dependent on any activations of the network
other than the final output values aL.
a. This is a sort of a restriction so as we can backpropagate.
Minimising the Cost Function:
1. To minimise the cost function, gradient descent is used.
a. Why gradient descent and why not calculus?
b. What is gradient descent?
Gradient Descent - An Example -
Consider a simple function C(v1, v2).
For small changes Δv1 and Δv2 , the cost function
changes as follows: ΔC = (∂C/∂v1)Δv1 +
(∂C/∂v2)Δv2
ΔC = ▽C.Δv; ΔC is the change in the cost, ▽C is the
gradient and Δv is the change in the parameters.
If Δv = - η▽C then ΔC = -η▽C2 i.e. the cost will
always decrease.
v → v’ = v - η▽C - update rule.
Stochastic Gradient Descent:
Recall cost function assumption 1, cost function can be written as an average of cost over individual
training examples.
To compute the gradient ∇C, you need to compute the gradient ∇Cx of each training input separately and
then average them ∇C = (∑∇Cx/ n) .
SGD - Calculate the gradient of a small mini batch of say m inputs and use that as an estimator of the
true gradient. Carry out updates using the gradient of the minibatch.
Carry out mini batch update for another randomly chosen batch and so on until the training inputs are
exhausted. This completes one epoch of learning.
Repeat for the specified number of epochs.
New hyperparameters - size of mini batch, number of epochs, learning rate.
But how do we calculate these gradients: Backpropagation
Let’s denote the input to any layer l as zl
zl= wl al-1 + bl
Output of layer l = al = f(wl al-1 + bl) = f( zl)
Let’s consider the output of the output layer L for an individual input xi.
aL(w,b,xi) = σ(zL(w,b,xi))
So the cost is Ci = (aL - yi)2.
So now we want to differentiate this cost to calculate the gradient of Ci.
Instead of differentiating w.r.t a we will differentiate w.r.t z .
Let’s do some math:
Chain Rule:
ꝺ (aL(w,b,xi)- yi)2/ ꝺ zL = 2 (aL(w,b,xi) - yi) ꝺ (σ(zL(w,b,xi))) /ꝺ zL
Sigmoid function :
Gradient of Ci w.r.t. zL (output layer)
d(aL(w,b,xi)- yi)2/ dzL = 2 (aL(w,b,xi) - yi) σ(zL(w,b,xi)) (1 - σ(zL(w,b,xi)))
zL(w,b,xi) = wL aL-1(w1:L-1, b1:L-1, xi) + bL
Aim is to calculate: d(aL(w,b,xi)- yi)2/dwL and d(aL(w,b,xi)- yi)2/dbL
dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
Gradients of the Last Layer:
dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
But we need the gradients of the entire network to update the weights and
biases of the network. How does gradients of the last layer help?
Backpropagation: propagating the network through the last layer
Gradients of any layer can be written in the form of gradients of the next layer.
Therefore, knowing the gradients of layer L, you can write the gradients of layer
L-1 in terms of gradients of layer L (which are known to you), gradients of
layer L-2 in terms of gradients of layer L-1 and so on.
Gradient of Ci w.r.t. zl
dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
Zl = wlal-1 + bl
dCi/ dwl = dCi/ dzl * dzl /dwl and dCi/ dbl= dCi/ dzl * dzl /dbl
To find out - dCi/ dzl - gradient of Ci w.r.t the cumulative input of the layer l.
Sorry!! A little more math:
dCi/ dzl = dCi/ dal * dal/ dzl
dCi/ dzl = σ(zl)(1 -σ(zl)) * d Ci/ ꝺal
dCi/ dzl = σ(zl)(1 -σ(zl)) *dCi/ dzl+1 *dzl+1/ dal
zl+1 = wl+!al + bl+1
dCi/ dzl = σ(zl)(1 -σ(zl)) * wl+1 * dCi/ dzl+1
dCi/ dwl = dCi/ dzl * dzl/ dwl
ANN on MNIST:
Convolutional Neural
Networks
How are CNN’s different from ANN’s:ow
ConvNet architectures make the explicit assumption that input are images.
Their architecture is different from feedforward neural networks to make them
more efficient by reducing the number of parameters to be learnt.
In ANN, if you have a 150x150x3 image, each neuron in the first hidden layer
will have 67500 weights to learn.
ConvNets have 3D input of neurons and the neurons in a layer are only
connected to a small region of the layer before it.
ConvNets:
The neurons in the layers of ConvNet are
arranged in 3 dimensions: height, width,
depth.
Depth here is not the depth of the entire
network. It refers to the third dimension of
the layers and hence a third dimension of
the activation volumes.
In essence, a ConvNet is made of layers
which have a simple API - transform a 3-D
input volume to a 3-D output volume with
some differentiable function which may or
may not have parameters.
Layers of a ConvNet:
Input Layer - 28 x 28 x 1 for MNIST (grayscale)
Convolution Layer - The neurons are connected to small/local regions in the input.
ReLU - This is the activation layer in the network.
Pool - This layer will downsample on the width and the height but not on depth. It applies a fixed function
such as max() or mean() etc.
Fully Connected Layer - Just like the last layer in feedforward networks, this layer too will give us the
class scores arranged across the depth dimension.
Convolutional and fully connected layers have parameters, relu and pooling layers do not.
Convolutional, fully connected and pooling layers have additional hyperparameters too.
Architecture:
Architecture:
Convolutional (Conv) Layer
In machine learning, this flashlight is known as a filter and the
region it shines over is known as the receptive field/size of the
filter.
Filter:
Is an array of numbers also known as the
weights/parameters (learnable).
A very important dimension to note is the depth of the
filter. The depth must be equal to the depth of the input
volume.
This filter will now slide/convolve over the rest of the image
performing element wise multiplication, summing it up and
returning a single number.
After convolving over the entire image, you will get an activation
map which is 2-D. For a 32x32x3 dimension input, using a
Filters:
You can increase the number of filters on the input
volume to increase the number of activation maps
you get. Each filter gives you an activation map.
Each activation map you get, tries to lean a different
aspect of the image such as an edge, a blotch of
colour etc.
If on a 32x32x3 image volume, you implement 12 filters
of size 5x5x3, then the first convolutional layer will
have dimension 28x28x12 under certain conditions.
Basically, the more the filters, the better the spatial
dimensions are preserved.
Now, let’s talk about the certain conditions
Filter - Hyperparameters:
The size of the filter is a hyperparameter.
When you apply a filter to an input volume, the output volume of the filter depends on 3 hyperparameters
- fibre/depth, stride and zero-padding.
Fibre/Depth - This refers to the number of filters applied to the input volume, each learning to recognise
something different in the input.
Stride - It refers to the pace at which the filter moves through the input volume. If stride is 1, we move
the filters one pixel at a time.
Zero-padding - To control the size of the output volume, the input volume can be padded with zeroes
around the border.
Given these hyperparameters the size of the output volume is given as: ((W-F+2P)/S) +1; W is the size
of input, F is the filter size, S is the stride.
Number of Parameters:
Consider the output volume of 28x28x12 of the first convolutional layer,
which was achieved by applying a filter of 5x5x3 on the input of
32x32x3.
Number of neurons in the layer = 28*28*12 = 9408
Each neuron has 5*5*3 = 75 weights and 1 bias i.e. 76 parameters.
Overall number of parameters of the first layer = 9408*76 = 715,008
Parameter Sharing:
Simple Assumption: For each activation map or depth slice, constrain the
neurons to use the same weights and bias. Therefore, for the last example,
the conv layer will have a set of 12 unique weights and 12 biases.
Overall number of weights in the first layer: 12*5*5*3 = 900
Total parameters = 900 +12 = 912
Conv Layer Summary:
Accepts an input volume of W1 x H1 x D1
Needs 4 Hyperparameters:
Number of filters = K
Filter/receptive field size = F
Stride = S
Zero-Padding = P
The output is of size W2 x H2 x D2 where;
W2 = ((W1-F+2P)/S) +1
ReLU (Rectified Linear Units) Layer:
Just like in feedforward neural networks, the purpose of an activation layer in
Convnet is to introduce nonlinearity.
You can also use activations like tanh or sigmoid but ReLU works better in
practice. Why?
It reduces the number of parameters in the network, thus enabling it to learn
faster.
Also, it helps us reduce the problem of vanishing gradients.
Pooling Layers:
Pooling layer is also known as the downsampling layer.
Use - Progressively reduce the spatial size of its input, thus reducing the
number of parameters in the network and controlling overfitting.
The pooling layer works on each depth slice independently, resizes it using the
mathematical operation specified such as MAX or Avg. etc.
Most common form of pooling is to apply a 2x2 filter with a stride of 2 on the
input volume.
The depth dimension will remain unchanged.
Pooling Layer:
A 2x2 filter with stride
as 2 applying a
MAX function.
As you can see the
number of
parameters are
reduced by 75%.
Fully Connected Layer, Dropout and Normalisation:
Fully connected layers in Convnet is exactly the same to the layers in
feedforward neural networks. The last layer in Convnet is a fully connected
softmax layer.
Dropout Layers: These layers have a very specific function in convnet which is
to avoid overfitting.
A random fraction of activations are ‘dropped out’ or set to 0 during the forward
pass by this layer. This makes sure that the network is not mugging the
training data. This layer is only used during training time.
Normalisation - These layers are usually added after pooling layers to normalise
the output of the pooling layer.
Transfer Learning:
Transfer learning is the process of taking a pre-trained model whose weights
and parameters have been trained on a large dataset, and fine-tune it
according to your own data.
You remove the last layer of the network and replace it with you own classifier;
and keep the weights and biases of the rest of the network constant.
The idea is that the pre-trained model will act as a feature extractor.
Convnet on MNIST:

More Related Content

What's hot

Two Days workshop on MATLAB
Two Days workshop on MATLABTwo Days workshop on MATLAB
Two Days workshop on MATLAB
Bhavesh Shah
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Hemant Jha
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programming
zukun
 

What's hot (20)

Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Implicit schemes for wave models
Implicit schemes for wave modelsImplicit schemes for wave models
Implicit schemes for wave models
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
parallel
parallelparallel
parallel
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Two Days workshop on MATLAB
Two Days workshop on MATLABTwo Days workshop on MATLAB
Two Days workshop on MATLAB
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
Vlsiphysicaldesignautomationonpartitioning 120219012744-phpapp01
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
Matlab practical file
Matlab practical fileMatlab practical file
Matlab practical file
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Brief intro : Invariance and Equivariance
Brief intro : Invariance and EquivarianceBrief intro : Invariance and Equivariance
Brief intro : Invariance and Equivariance
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Skiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programmingSkiena algorithm 2007 lecture18 application of dynamic programming
Skiena algorithm 2007 lecture18 application of dynamic programming
 
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
ARCHITECTURAL CONDITIONING FOR DISENTANGLEMENT OF OBJECT IDENTITY AND POSTURE...
 
vector QUANTIZATION
vector QUANTIZATIONvector QUANTIZATION
vector QUANTIZATION
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
Deep Generative Models II (DLAI D10L1 2017 UPC Deep Learning for Artificial I...
 

Similar to Neural Networks - How do they work?

CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
RajJain516913
 
Chapter No. 6: Backpropagation Networks
Chapter No. 6:  Backpropagation NetworksChapter No. 6:  Backpropagation Networks
Chapter No. 6: Backpropagation Networks
RamkrishnaPatil17
 
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptxModelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
KadiriIbrahim2
 
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docxSAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
anhlodge
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineering
mloeb825
 

Similar to Neural Networks - How do they work? (20)

Writing your own Neural Network.
Writing your own Neural Network.Writing your own Neural Network.
Writing your own Neural Network.
 
Cheatsheet deep-learning
Cheatsheet deep-learningCheatsheet deep-learning
Cheatsheet deep-learning
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdfCD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
CD504 CGM_Lab Manual_004e08d3838702ed11fc6d03cc82f7be.pdf
 
9 neural network learning
9 neural network learning9 neural network learning
9 neural network learning
 
Backpropagation - A peek into the Mathematics of optimization.pdf
Backpropagation - A peek into the Mathematics of optimization.pdfBackpropagation - A peek into the Mathematics of optimization.pdf
Backpropagation - A peek into the Mathematics of optimization.pdf
 
Chapter No. 6: Backpropagation Networks
Chapter No. 6:  Backpropagation NetworksChapter No. 6:  Backpropagation Networks
Chapter No. 6: Backpropagation Networks
 
Illustrative Introductory Neural Networks
Illustrative Introductory Neural NetworksIllustrative Introductory Neural Networks
Illustrative Introductory Neural Networks
 
NN-Ch6.PDF
NN-Ch6.PDFNN-Ch6.PDF
NN-Ch6.PDF
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptxModelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
Modelling using differnt metods in matlab2 (2) (2) (2) (4) (1) (1).pptx
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Introduction to Neural Netwoks
Introduction to Neural Netwoks Introduction to Neural Netwoks
Introduction to Neural Netwoks
 
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docxSAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
SAMPLE QUESTIONExercise 1 Consider the functionf (x,C).docx
 
03 image transformations_i
03 image transformations_i03 image transformations_i
03 image transformations_i
 
Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
 
Capstone paper
Capstone paperCapstone paper
Capstone paper
 
23 industrial engineering
23 industrial engineering23 industrial engineering
23 industrial engineering
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 

More from Accubits Technologies

More from Accubits Technologies (6)

AI-powered real-time video analytics for Manufacturing sector
AI-powered real-time video analytics for Manufacturing sectorAI-powered real-time video analytics for Manufacturing sector
AI-powered real-time video analytics for Manufacturing sector
 
AI-powered real-time video analytics for defence sector
AI-powered real-time video analytics for defence sectorAI-powered real-time video analytics for defence sector
AI-powered real-time video analytics for defence sector
 
Blockchain and IoT For Supply Chain Traceability
Blockchain and IoT For Supply Chain TraceabilityBlockchain and IoT For Supply Chain Traceability
Blockchain and IoT For Supply Chain Traceability
 
ICOs : past, present and future
ICOs : past, present and futureICOs : past, present and future
ICOs : past, present and future
 
High Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloudHigh Performance Computing (HPC) in cloud
High Performance Computing (HPC) in cloud
 
Blockchain in Bioinformatics
Blockchain in BioinformaticsBlockchain in Bioinformatics
Blockchain in Bioinformatics
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Neural Networks - How do they work?

  • 1. Classification of Handwritten Digits - Artificial and Convolutional Networks
  • 2. Getting the Code: You can download the code used in the session from https://github.com/Manuyashchaudhary/AnnSession.git For setting up the environment, you can use the docker command - docker pull manuyash/annsession
  • 3. Structure of the Session: 1. Linear, Logistic Regression & Multinomial Logistic Regression - Quick Review 2. Structure of ANN’s 3. Learning in ANN’s 4. Implementation of ANN on MNIST 5. Layers of CNN - Convolutional, RELU, Pooling 6. Parameter Sharing 7. Implementation on MNIST
  • 4. Data: Data is the key to solving any machine learning or artificial intelligence problem. Divide the dataset it into training and testing data. The algorithm learns on training data and is evaluated on testing data.
  • 5. MNIST: Images are of uniform colour and size. Images are of size 28X28 pixels. Each pixel in the image can take value from 0 to 256. Properties: Fixed space, constant background.
  • 6. Image as Input: This is a 14X14 representation of pixel intensities. The numbers represent the fraction of pixel covered by the digit.
  • 7. Starting from the basics: Linear Regression To predict values of a variable (Y) which is dependent on multiple other independent variables (X). Fit a line to your training data which generalises the data well. y= mx +c; m is the slope and c is the constant.
  • 8. Fitting a line: For regression, y = x𝛃 To find the line of best fit, find 𝛃’s such that the line generalises the data well. . Cost/Error = Mean Squared Error Minimise the error to get line of best fit.
  • 10. Key Takeaways: You train an algorithm on training data. You are training to find best possible combination of parameters. There is a cost function. The cost function can be minimised by gradient descent. Perform a dot product between parameters and the unseen data and get the output.
  • 11. Logistic Regression: 1. It is used for a binary classification. 2. Outputs probability of a class. 3. It is based on a function called sigmoid function.
  • 12. Sigmoid Function: Linear Regression - y(x) = 𝛃x Logistic - log(p/1-p) = 𝛃x y(x) = p = 1/ [1 + e-βx ] Cost function is :
  • 13. Key Takeaways: Perform dot product between parameters and input; apply sigmoid function. There is a cost function. The cost function can be minimised for the parameters using gradient descent. What happens when the variable which is being predicted has more than 2 classes?
  • 14. Softmax: 1. Multinomial logistic regression. 2. Sigmoid → Softmax 3. For each output/category, we compute a weighted sum of the x’s, add a bias, and then apply softmax. 4. Softmax is defined as = Where xj is the summation of the jth neuron.
  • 16. Softmax to Neural Networks: Neural networks can be considered as a network of multiple logistic regression computations stacked in parallel and in series to each other. Softmax is applied to the last layer in neural networks. You will understand these points as we go forward.
  • 17. Structure of Artificial Neural Networks: 1. Input, output and hidden layers. 2. The layers are arranged sequentially and each layer is made of multiple neurons. 3. Input layer number of neurons = length of input vector 4. Output layer number of neurons = number of classes in the dependent or target variable.
  • 18. Assumptions, Parameters and Hyperparameters Neurons within a layer do not interact with each other. The layers are densely connected. For every neuron there is a bias and for every interaction there is a weight. The parameters are the weights and biases of the network which are to be found.
  • 19. Working of a Neuron: The input to a neuron is the weighted sum of inputs + bias. Activation function is used to introduce non-linearity in the network. If the output is greater than a threshold, the neuron will fire, otherwise not.
  • 20. Message passing in NN’s Consider 2 hidden layers ‘l-1’ and ‘l’. Total number of interconnections would be (n1 x n2). Output of layer ‘l-1’ is al-1 = (al-1 1, al-1 2, ….., al-1 n1) The output of layer l-1 will be the input to layer l. The input of layer l will be w.al-1, where w is n2xn1 matrix. Add bias to this. Output of layer l is activation applied to the input to layer l.
  • 21. Activation Functions: Activation functions are used to introduce non linearity in the network. It makes neural networks more compact. Due to this nonlinearity neural networks can approximate any measurable function. Activation functions should be smooth.
  • 22. Activation Functions: Sigmoid = 1/ 1 + e-x Tanh = (ez - e-z )/ (ez +e-z ) RELU = max(0,a)
  • 23. Sigmoid vs Tanh vs RELU Sigmoid suffers from the problem of vanishing gradient. Tanh has stronger gradients thus reducing the problem of vanishing gradient. RELU reduces the likelihood of the problem of vanishing gradient and also introduces sparsity to the network but it tend to blow up the activation. In practice, tanh usually work the best.
  • 24. Formulating a Cost Function: What is the output of the network? aL(w, b, xi) Actual Output - yi Cost Incurred on 1 Data point - Ci(aL i, yi) Total Cost - Sum of Individual costs over all data point ∑Ci Training Problem - min ∑Ci for w, b Optimisation problem - Find best combination of w, b.
  • 25. Different Cost functions: Quadratic Cost - C = ∑(aL i - yi)2 / n Cross Entropy - C = -∑[ yi ln aL i + (1-yi )ln 1-aL i ] = -∑∑(yij ln(aL ij) Exponential - C = Τ exp ( ∑(aL i - yi)2 / Τ) Kullback - Leibler Divergence = DKL(P∥Q) = DKL(yi∥aL i ) = ∑yi ln(yi/aL i)
  • 26. Cost Function: Properties 1. We must be able to write the cost function C as an average over cost functions Cx of individual training examples x. a. It allows the gradient of a single training example to be calculated. 2. The cost function should not be dependent on any activations of the network other than the final output values aL. a. This is a sort of a restriction so as we can backpropagate.
  • 27. Minimising the Cost Function: 1. To minimise the cost function, gradient descent is used. a. Why gradient descent and why not calculus? b. What is gradient descent?
  • 28. Gradient Descent - An Example - Consider a simple function C(v1, v2). For small changes Δv1 and Δv2 , the cost function changes as follows: ΔC = (∂C/∂v1)Δv1 + (∂C/∂v2)Δv2 ΔC = ▽C.Δv; ΔC is the change in the cost, ▽C is the gradient and Δv is the change in the parameters. If Δv = - η▽C then ΔC = -η▽C2 i.e. the cost will always decrease. v → v’ = v - η▽C - update rule.
  • 29. Stochastic Gradient Descent: Recall cost function assumption 1, cost function can be written as an average of cost over individual training examples. To compute the gradient ∇C, you need to compute the gradient ∇Cx of each training input separately and then average them ∇C = (∑∇Cx/ n) . SGD - Calculate the gradient of a small mini batch of say m inputs and use that as an estimator of the true gradient. Carry out updates using the gradient of the minibatch. Carry out mini batch update for another randomly chosen batch and so on until the training inputs are exhausted. This completes one epoch of learning. Repeat for the specified number of epochs. New hyperparameters - size of mini batch, number of epochs, learning rate.
  • 30. But how do we calculate these gradients: Backpropagation Let’s denote the input to any layer l as zl zl= wl al-1 + bl Output of layer l = al = f(wl al-1 + bl) = f( zl) Let’s consider the output of the output layer L for an individual input xi. aL(w,b,xi) = σ(zL(w,b,xi)) So the cost is Ci = (aL - yi)2. So now we want to differentiate this cost to calculate the gradient of Ci. Instead of differentiating w.r.t a we will differentiate w.r.t z .
  • 31. Let’s do some math: Chain Rule: ꝺ (aL(w,b,xi)- yi)2/ ꝺ zL = 2 (aL(w,b,xi) - yi) ꝺ (σ(zL(w,b,xi))) /ꝺ zL Sigmoid function :
  • 32. Gradient of Ci w.r.t. zL (output layer) d(aL(w,b,xi)- yi)2/ dzL = 2 (aL(w,b,xi) - yi) σ(zL(w,b,xi)) (1 - σ(zL(w,b,xi))) zL(w,b,xi) = wL aL-1(w1:L-1, b1:L-1, xi) + bL Aim is to calculate: d(aL(w,b,xi)- yi)2/dwL and d(aL(w,b,xi)- yi)2/dbL dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL
  • 33. Gradients of the Last Layer: dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL But we need the gradients of the entire network to update the weights and biases of the network. How does gradients of the last layer help? Backpropagation: propagating the network through the last layer Gradients of any layer can be written in the form of gradients of the next layer. Therefore, knowing the gradients of layer L, you can write the gradients of layer L-1 in terms of gradients of layer L (which are known to you), gradients of layer L-2 in terms of gradients of layer L-1 and so on.
  • 34. Gradient of Ci w.r.t. zl dCi/ dwL = dCi/ dzL * dzL /dwL and dCi/ dbL = dCi/ dzL * dzL /dbL Zl = wlal-1 + bl dCi/ dwl = dCi/ dzl * dzl /dwl and dCi/ dbl= dCi/ dzl * dzl /dbl To find out - dCi/ dzl - gradient of Ci w.r.t the cumulative input of the layer l.
  • 35. Sorry!! A little more math: dCi/ dzl = dCi/ dal * dal/ dzl dCi/ dzl = σ(zl)(1 -σ(zl)) * d Ci/ ꝺal dCi/ dzl = σ(zl)(1 -σ(zl)) *dCi/ dzl+1 *dzl+1/ dal zl+1 = wl+!al + bl+1
  • 36. dCi/ dzl = σ(zl)(1 -σ(zl)) * wl+1 * dCi/ dzl+1 dCi/ dwl = dCi/ dzl * dzl/ dwl
  • 39. How are CNN’s different from ANN’s:ow ConvNet architectures make the explicit assumption that input are images. Their architecture is different from feedforward neural networks to make them more efficient by reducing the number of parameters to be learnt. In ANN, if you have a 150x150x3 image, each neuron in the first hidden layer will have 67500 weights to learn. ConvNets have 3D input of neurons and the neurons in a layer are only connected to a small region of the layer before it.
  • 40. ConvNets: The neurons in the layers of ConvNet are arranged in 3 dimensions: height, width, depth. Depth here is not the depth of the entire network. It refers to the third dimension of the layers and hence a third dimension of the activation volumes. In essence, a ConvNet is made of layers which have a simple API - transform a 3-D input volume to a 3-D output volume with some differentiable function which may or may not have parameters.
  • 41. Layers of a ConvNet: Input Layer - 28 x 28 x 1 for MNIST (grayscale) Convolution Layer - The neurons are connected to small/local regions in the input. ReLU - This is the activation layer in the network. Pool - This layer will downsample on the width and the height but not on depth. It applies a fixed function such as max() or mean() etc. Fully Connected Layer - Just like the last layer in feedforward networks, this layer too will give us the class scores arranged across the depth dimension. Convolutional and fully connected layers have parameters, relu and pooling layers do not. Convolutional, fully connected and pooling layers have additional hyperparameters too.
  • 44. Convolutional (Conv) Layer In machine learning, this flashlight is known as a filter and the region it shines over is known as the receptive field/size of the filter. Filter: Is an array of numbers also known as the weights/parameters (learnable). A very important dimension to note is the depth of the filter. The depth must be equal to the depth of the input volume. This filter will now slide/convolve over the rest of the image performing element wise multiplication, summing it up and returning a single number. After convolving over the entire image, you will get an activation map which is 2-D. For a 32x32x3 dimension input, using a
  • 45. Filters: You can increase the number of filters on the input volume to increase the number of activation maps you get. Each filter gives you an activation map. Each activation map you get, tries to lean a different aspect of the image such as an edge, a blotch of colour etc. If on a 32x32x3 image volume, you implement 12 filters of size 5x5x3, then the first convolutional layer will have dimension 28x28x12 under certain conditions. Basically, the more the filters, the better the spatial dimensions are preserved. Now, let’s talk about the certain conditions
  • 46. Filter - Hyperparameters: The size of the filter is a hyperparameter. When you apply a filter to an input volume, the output volume of the filter depends on 3 hyperparameters - fibre/depth, stride and zero-padding. Fibre/Depth - This refers to the number of filters applied to the input volume, each learning to recognise something different in the input. Stride - It refers to the pace at which the filter moves through the input volume. If stride is 1, we move the filters one pixel at a time. Zero-padding - To control the size of the output volume, the input volume can be padded with zeroes around the border. Given these hyperparameters the size of the output volume is given as: ((W-F+2P)/S) +1; W is the size of input, F is the filter size, S is the stride.
  • 47. Number of Parameters: Consider the output volume of 28x28x12 of the first convolutional layer, which was achieved by applying a filter of 5x5x3 on the input of 32x32x3. Number of neurons in the layer = 28*28*12 = 9408 Each neuron has 5*5*3 = 75 weights and 1 bias i.e. 76 parameters. Overall number of parameters of the first layer = 9408*76 = 715,008
  • 48. Parameter Sharing: Simple Assumption: For each activation map or depth slice, constrain the neurons to use the same weights and bias. Therefore, for the last example, the conv layer will have a set of 12 unique weights and 12 biases. Overall number of weights in the first layer: 12*5*5*3 = 900 Total parameters = 900 +12 = 912
  • 49. Conv Layer Summary: Accepts an input volume of W1 x H1 x D1 Needs 4 Hyperparameters: Number of filters = K Filter/receptive field size = F Stride = S Zero-Padding = P The output is of size W2 x H2 x D2 where; W2 = ((W1-F+2P)/S) +1
  • 50. ReLU (Rectified Linear Units) Layer: Just like in feedforward neural networks, the purpose of an activation layer in Convnet is to introduce nonlinearity. You can also use activations like tanh or sigmoid but ReLU works better in practice. Why? It reduces the number of parameters in the network, thus enabling it to learn faster. Also, it helps us reduce the problem of vanishing gradients.
  • 51. Pooling Layers: Pooling layer is also known as the downsampling layer. Use - Progressively reduce the spatial size of its input, thus reducing the number of parameters in the network and controlling overfitting. The pooling layer works on each depth slice independently, resizes it using the mathematical operation specified such as MAX or Avg. etc. Most common form of pooling is to apply a 2x2 filter with a stride of 2 on the input volume. The depth dimension will remain unchanged.
  • 52. Pooling Layer: A 2x2 filter with stride as 2 applying a MAX function. As you can see the number of parameters are reduced by 75%.
  • 53. Fully Connected Layer, Dropout and Normalisation: Fully connected layers in Convnet is exactly the same to the layers in feedforward neural networks. The last layer in Convnet is a fully connected softmax layer. Dropout Layers: These layers have a very specific function in convnet which is to avoid overfitting. A random fraction of activations are ‘dropped out’ or set to 0 during the forward pass by this layer. This makes sure that the network is not mugging the training data. This layer is only used during training time. Normalisation - These layers are usually added after pooling layers to normalise the output of the pooling layer.
  • 54. Transfer Learning: Transfer learning is the process of taking a pre-trained model whose weights and parameters have been trained on a large dataset, and fine-tune it according to your own data. You remove the last layer of the network and replace it with you own classifier; and keep the weights and biases of the rest of the network constant. The idea is that the pre-trained model will act as a feature extractor.

Editor's Notes

  1. Minimising the cost - with grasdiesnt descent you calculate small changes in the cost function for small changes in the values of the parameters. and you try to update the parameters in such a way that the changes in the cost function are decreasing it value each time. basically taking it towards a minimum. which is the goal
  2. Minimising the cost - with calculus you need to differentiate the cost w.r.t. The parameters. with gradient descent you calculate small changes in the cost function for small changes in the values of the parameters. and you try to update the parameters in such a way that the changes in the cost function are decreasing it value each time. basically taking it towards a minimum. which is the goal. So basically to find Y, we are performing a dot product between the Beta’s i.e. the coefficients/weights and the input data and adding it up.
  3. The aim of any ml algorithm in predictive analytics, is to predict something and to get the best possible predictions, you define and minimise the cost function.
  4. 2. And on the basis of that probability you can classify if the data belongs to positive or negative class. 4. If the probability is greater than 0.5 then you classify the input to positive class, otherwise to the negative class. One of the nice properties of logistic regression is that the logistic cost function (or max-entropy) is convex,
  5. This is like a single layer perceptron. Explain the equations of perceptrons and the concept of firing.