2. Outline
1. Why Care?
2. Why it works?
3. Logistic Regression to Neural Networks
4. Deep Networks and Issues
5. Autoencoders and Stacked Autoencoders
6. Why Deep Learning Works
7. Theano Overview
8. Code Hands On
3. Why Care?
● Has bettered state of art of various tasks
– Speech Recognition
● Microsoft Audio Video Indexing Service speech system uses Deep Learning
● Brought down word error rate by 30% compared to SOTA GMM
– Natural Language Understanding/Processing
● Neural net language models are current state of art in language modeling, Sentiment
Analysis, Paraphrase Detection and many other NLP task
● SENNA system uses neural embedding for various NLP tasks like POS tagging, NER,
chunking etc.
● SENNA is not only better than SOTA but also much faster
– Object Recognition
● Breakthrough started in 2006 with MNIST dataset. Still the best (0..27% error )
● SOTA error rate over ImageNet dataset down to 15.3% compared to 26.1%
7. #2 Distributed Representation
● Features can be nonmutually exclusive example language.
● Need to move beyond onehot representations such as from clustering
algorithms, knearest neighbors etc
● O(n) parameters/examples for O(N) input regions but we can do better O(k)
parameters/inputs for O(2k
) input regions
● Similar to multiclustering, where multiple clustering algorithms are applied in
parallel or same clustering algorithm is applied to different input region
10. Logistic Regression
● Logistic regression is a probabilistic, linear classifier
● Parameterized by W and b
● Each output is probability of input belonging to class yi
● Prob of x being member of class Yi
is calculated as
where
and
P(Y=Yi∣x)=softmaxi(Wx+b)
softmaxi (Wx +b)=
e
W i
Tx
+ bi
∑
j
e
W j
Tx
+ bi
ypred=argmaxYi
P(Y=Yi∣x)
12. Multilayer preceptron
● An MLP can be viewed as a logistic regression classifier where the
input is first transformed using a learned non-linear transformation.
● The non linear transformation can be a sigmoid or a tanh function.
● Due to addition of non linear hidden layer, the loss function now is non
convex
● Though there is no sure way to avoid minima, some empirical value
initializations of the weight matrix helps.
15. AutoEncoders
● Multilayer neural nets with target output = input.
● Reconstruction = decoder(encoder(input))
● Objective is to minimize the reconstruction error.
● PCA can be seen as a auto-encoder with and
● So autoencoder could be seen as an non-linear PCA which tries of learn latent
representation of input.
a=tanh(Wx+b)
x '=tanh(W
Tx
+c)
L=‖x '−x‖2
a=Wx x'=W
T
a
19. Why PreTraining Works
● Hard to know exactly as deep nets are hard to analyze
● Regularization Hypothesis
– PreTraining acts as adding regularization term leading to better generalization.
– It can be seen as good representation for P(X) leads to good representation of P(Y|X)
● Optimization Hypothesis
– PreTraining leads to weight initialization that restricts the parameter space near better
local minima
– These minimas are not achievable via random initialization