Deep learning from a novice perspective

Deep learning from a novice perspective and recent innovations from KGPians
Anirban Santara
Doctoral Research Fellow
Department of CSE, IIT Kharagpur
bit.do/AnirbanSantara

Deep Learning
Just a kind of
Machine Learning
Classification
Regression
Clustering
3 main tasks:

CLASSIFICATION
Pandas Dogs
Cats
Rather:
P(class| )?

REGRESSION
Independent variable (feature)
Dependent variable
(target attribute)

CLUSTERING
Attribute 1
Attribute 2

The methodology:
1. Design a hypothesis function: h(y|x,θ)
Target attribute Input Parameters of the
learning machine
2. Keep improving the hypothesis until
the prediction happens really good

Well, how bad is your hypothesis?
In case of regressions:
A very common measure is mean
squared error:
𝐸 =
𝑎𝑙𝑙 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠
|𝑦 𝑑𝑒𝑠𝑖𝑟𝑒𝑑 − 𝑦 𝑎𝑠 𝑝𝑒𝑟 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠|
2
In classification problems: 1
0
0
1
In one-hot classification frameworks, we often use mean square error
However, often we ask for the probabilities of occurrence of the different classes for a
given input ( Pr(class|X) ). In that case we use K-L divergence between the observed
(p(output classes)) and predicted (q(output classes)) distributions as the measure of
error. This is sometimes referred to as the cross entropy error criterion.
𝐾𝐿(𝑃| 𝑄 =
𝑎𝑙𝑙 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠,𝑖
𝑝𝑖 𝑙𝑜𝑔
𝑝𝑖
𝑞𝑖
Clustering uses a plethora of criteria
like:
• Entropy of a cluster
• Maximum distance between 2
neighbors in a cluster
--and a lot more

Now its time to rectify the machine and improve
$100,000
$50,000
Learning
We perform “gradient descent” along the “error-plane” in
the “parameter space”:
∆𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 = −learning_rate ∗ 𝛻𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑒𝑟𝑟𝑜𝑟_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 ← 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 + ∆𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟

Lets now look into a practical learning system:
Artificial Neural Network
Cat
Dog
Panda
- A very small unit of computation
So the parameters of an ANN are:
1. Incoming weights of every neuron
2. Bias of every neuron
These are the ones that need to be tuned
during learning
We perform gradient descent on these
parameters
Backpropagation algorithm is a popular
method of computing
𝛁 𝒘𝒆𝒊𝒈𝒉𝒕𝒔 𝒂𝒏𝒅 𝒃𝒊𝒂𝒔𝒆𝒔 𝑬𝒓𝒓𝒐𝒓 𝒇𝒖𝒏𝒄𝒕𝒊𝒐𝒏

Backpropagation algorithm
Input
pattern
vector
W21 W32
Forward propagate:
Error calculation:
Backward propagation:
If k  output layer
If k  hidden layer

Well after all, life is tough…
• The parameters of a neural network are generally initialized to random values.
• Starting from these random values (with useless information)
it is very difficult (well not impossible, in fact time consuming)
for backpropagation to arrive at the correct values of these
parameters.
• Exponential activation functions like sigmoid and hyperbolic-
tangent are traditionally used in artificial neurons. These
functions have gradients that are prone to become zero
in course of backpropagation.
• If the gradients in a layer get close to zero, they induce the
gradients in the previous layers to vanish too. As a result the
weights and biases in the lower layers remain immature.
• This phenomenon is called “vanishing gradient” problem in the literature.
These problems crop up very frequently in neural networks that contain a
large number of hidden layers and way too many parameters
(the so called Deep Neural Networks).

How to get around? Ans: Make “informed” initialization
• A signal is nothing but a set of random variables.
• These random variables jointly take values from a probability distribution that is dependent on the nature of the
source of the signal.
E.g.: A blank 28x28 pixel array like can house numerous kinds of images. The set of 784 random variables assume
values from a different joint probability distribution for every class of objects/scenes.
~𝑃𝑑𝑖𝑔𝑖𝑡(𝑥1, 𝑥2, … , 𝑥784)
~𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒(𝑥1, 𝑥2, … , 𝑥784)

Lets try and model the probability distribution of interest
Our target distribution: 𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒(𝑥1, 𝑥2, … , 𝑥784)
We try to capture this
distribution in a model
that looks quite similar to
a single layer neural
network
The Restricted Boltzmann Machine: It’s a probabilistic graphical model (a special kind of Markov Random Field) that is
capable of modelling a wide variety of probability distributions.
Capture the dependencies among the “visible”
variables

The working of RBM
Parameters of the RBM:
1. Weights on the edges 𝑤𝑖,𝑗
2. Biases on each node 𝑏𝑖
′
s and 𝑐𝑗
′
𝑠
Using these we define a joint probability distribution over the
“visible” variables 𝑣𝑗
′
𝑠 and the “hidden” variables ℎ𝑖
′
𝑠 as:
Where the energy function is defined as:
And Z is a normalization term called the “Partition function”
𝑃𝑅𝐵𝑀 𝒗, 𝒉 =
1
𝑍
𝑒 𝐸(𝒗,𝒉)
𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒(𝑣1, 𝑣2, … , 𝑣784)
𝒉
𝑃𝑅𝐵𝑀 𝒗, 𝒉
𝑃𝑅𝐵𝑀 𝑣1, 𝑣2, … , 𝑣784
𝐾𝐿(𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒| 𝑃𝑅𝐵𝑀 =
𝑣1,𝑣2,…,𝑣784
𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 𝑣1, … , 𝑣784 𝑙𝑛
𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒(𝑣1, … , 𝑣784)
𝑃𝑅𝐵𝑀 𝑣1, … , 𝑣784
= −𝐻 𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 −
𝑣1,𝑣2,…,𝑣784
𝑃ℎ𝑢𝑚𝑎𝑛 𝑓𝑎𝑐𝑒 𝑣1, … , 𝑣784 𝑙𝑛𝑃𝑅𝐵𝑀 𝑣1, … , 𝑣784
Empirical average of the log-likelihood of data under the model distribution
Not under our control
MAXIMIZE

Layer-wise pre-training using RBM
• Every hidden layer is pre-trained
as the hidden layer of a RBM
As RBM models the statistics of
the input, the weights and
biases carry meaningful
information about the input.
Use of these as initial values of
the parameters of a deep neural
network has shown phenomenal
improvement over random
initialization both in terms of
time complexity and
performance.
• This is followed by fine-tuning
over the entire network via
back-propagation

• Autoencoder is a neural network operating in unsupervised
learning mode
• The output and the input are set equal to each other
• Learns an identity mapping from the input to the output
• Applications:
• Dimensionality reduction (Efficient, non-linear)
• Representation learning (discovering interesting structures)
• Alternative to RBM for layer-wise pre-training of DNN.
The Autoencoder
A deep stacked autoencoder

So deep learning ≈ training “deep” neural
networks with many hidden layers
Step 1: Unsupervised layer-wise pre-training
Step 2: Supervised fine-tuning
- This is pretty much all about how deep learning works. However
there is a class of deep networks called convolutional neural
networks that often do not need pre-training because these
networks use extensive parameter sharing and use rectified linear
activation functions.
Well, deep learning when viewed from a different
perspective looks really amazing!!!

Traditional machine learning v.s. deep learning
Data
Hand-engineering of feature
extractors
Data–driven target-oriented representation learning
Data
representations by
feature extractors
• Classification
• Regression
• Clustering
• Efficient
coding
Inference
engine

What’s so special about it?
Traditional machine learning Deep learning
• Designing feature detectors requires careful engineering and
considerable domain expertise
• Representations must be selective to aspects of data that are
important for our task and invariant to the irrelevant aspects
(selectivity-invariance dilemma)
• Abstractions of hierarchically increasing complexity are learnt by
a data driven approach using general purpose learning
procedures
• A composition of simple non-linear modules can learn very
complex functions
• Cost functions specific to the problem amplify aspects of the
input that are important for the task and suppress irrelevant
variations

Pretty much how we humans go about analyzing…

Some deep architectures:-
Deep stacked autoencoder
Deep convolutional neural
network
Recurrent neural network
Used for efficient non-linear dimensionality reduction and
discovering salient underlying structures in data
Exploits stationarity of
natural data and uses the
concept of parameter
sharing to study large
images, long spoken/
written strings to make
inferences from them
Custom made for modelling dynamic systems
and find use in natural language (speech and
text) processing, machine translation, etc.

Classical automatic speech recognition system
Viterbi
beam
search /
A*
decoding
N-best
sentences or
word lattice
Rescoring
FINAL
UTTERRENCE
Acoustic model generation
Sentence model preparation
Phonetic
utterance models
Sentence model
Signal
acquisition
Feature extraction
Acoustic modelling

Some of our works:-
2015:
Deep neural network and Random Forest hybrid architecture for
learning to detect retinal vessels in fundus images (accepted at
EMBC-2015, Milan, Italy)
Our architecture:
Average accuracy of detection: 93.27%
2014-15:
Faster learning of deep stacked autoencoders on multi-core
systems through synchronized layer-wise pre-training (accepted at
PDCKDD Workshop, a part of ECML-PKDD 2015, Porto, Portugal)
Conventional serial pre-training:
Proposed algorithm:
26% speedup for compression of MNIST handwritten digits

Take-home messages
• Deep learning is a set of algorithms that have been designed to
1. Train neural networks with a large number of hidden layers.
2. Learn features of hierarchically increasing complexity in a data and objective – driven method.
• Deep neural networks are breaking all world records in AI because it can be proved that they have the capacity of
modelling highly non-linear functions of the data with fewer parameters than shallow networks.
• Deep learning is extremely interesting and a breeze to implement once the underlying philosophies are understood. It
has great potential of being used in a lot of ongoing projects at KGP.
If you are interested to go deep into deep learning…
Take Andrew Ng’s
Machine Learning
course on Coursera
Visit
ufldl.Stanford.edu
and read the entire
tutorial
Read LeCun’s latest
deep learning
review published in
Nature

Thank you so much 
Please give me some feedback for this talk by visiting:
bit.do/RateAnirban
Or just scan the QR code 

Deep learning from a novice perspective

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deep learning from a novice perspective

Ähnlich wie Deep learning from a novice perspective (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deep learning from a novice perspective