6. DEEP LEARNING IN COGNITIVE DOMAINS
5/12/16 6
http://media.npr.org/
http://cdn.cultofmac.com/
Where human can
recognise, act or answer
accurately within seconds
blogs.technet.microsoft.com
8. dbta.com
DEEP LEARNING IN NON-COGNITIVE DOMAINS
Â⯠Where humans need extensive training to do well
Â⯠Domains that demand transparency & interpretability.
⯠⌠healthcare
⯠⌠security
⯠⌠genetics, foods, water âŚ
5/12/16 8
 TEKsystems
9. AGENDA
âŻPart I: Theory
âŻIntroduction to (mostly supervised) deep learning
âŻPart II: Practice
âŻApplying deep learning to non-cognitive domains
âŻPart III: Advanced topics
âŻPosition yourself
5/12/16 9
10. PART I: THEORY
INTRODUCTION TO (MOSTLY SUPERVISED) DEEP LEARNING
âŻNeural net as function approximation & feature detector
âŻThree architectures: FFN â RNN â CNN
âŻBag of tricks: dropout â piece-wise linear units â skip-connections
â adaptive stochastic gradient
5/12/16 10
11. PART II: PRACTICE
APPLYING DEEP LEARNING TO NON-COGNITIVE DOMAINS
âŻHand-on:
ÂâŻIntroducing programming frameworks
(Theano, TensorFlow, Keras, Mxnet)
âŻDomains how-to:
ÂâŻHealthcare
ÂâŻSoftware engineering
ÂâŻAnomaly detection
5/12/16 11
12. PART III: ADVANCED TOPICS
POSITION YOURSELF
âŻUnsupervised learning
âŻRelational data & structured outputs
âŻMemory & attention
âŻHow to position ourselves
5/12/16 12
14. http://blog.refu.co/wp-content/uploads/2009/05/mlp.png
WHAT IS DEEP LEARNING?
âŻFast answer: multilayer perceptrons (aka deep
neural networks) of the 1980s rebranded in
2006.
ÂâŻBut early nets go stuck at 1-2 hidden layers.
âŻSlow answer: distributed representation, multiple
steps of computation, modelling the
compositionality of the world, a better prior,
advances in compute, data & optimization, neural
architectures, etc.
5/12/16 14
19. STARTING POINT: FEATURE LEARNING
âŻIn typical machine learning projects, 80-90% effort is on feature engineering
ÂâŻA right feature representation doesnât need much work. Simple linear methods often work
well.
âŻText: BOW, n-gram, POS, topics, stemming, tf-idf, etc.
âŻSW: token, LOC, API calls, #loops, developer reputation, team complexity,
report readability, discussion length, etc.
âŻTry yourself on Kaggle.com!
5/12/16 19
20. THE BUILDING BLOCKS:
FEATURE DETECTOR
5/12/16 20
Integrate-and-fire neuron
andreykurenkov.com
signals
feedbacks
Feature detector
Block representation
21. THREE MAIN ARCHITECTURES: FNN, RNN & CNN
âŻFeed-forward (FFN): Function approximator (Vector to vector)
ÂâŻMost existing ML/statistics fall into this category
âŻRecurrent (RNN): Sequence to sequence
ÂâŻTemporal, sequential. E.g., sentence, actions, DNA, EMR
ÂâŻProgram evaluation/execution. E.g., sort, traveling salesman problem
âŻConvolutional (CNN): Image to vector/sequence/image
ÂâŻIn time: Speech, DNA, sentences
ÂâŻIn space: Image, video, relations
5/12/16 21
22. FEED-FORWARD NET (FFN)
VEC2VEC MAPPING
5/12/16 22
Forward pass: f(x) that
carries unitsâ contribution
to outcome
Backward pass: fâ(x)
that assigns credits
to units
âŻProblems with depth:
âŻMultiple chained non-linearity steps
âŻĂ neural saturation (top units have little signal from the
bottom)
âŻĂ vanishing gradient (bottom layers have much less
training information from the label).
23. TWO VIEWS OF FNN
vâŻThe functional view: FNN as nested
composition of functions
5/12/16 23
(Bengio, DLSS 2015)
vâŻThe representation view: FNN as staged
abstraction of data
24. SOLVING PROBLEM OF VANISHING GRADIENTS
âŻPrinciple: Enlarging the channel to pass feature and gradient
âŻMethod 1: Removing saturation with piecewise linear units
ÂâŻRectifier linear unit (ReLU)
âŻMethod 2: Explicit copying through gating & skip-connection
ÂâŻHighway networks
ÂâŻSkip-connection: ResNet
25. METHOD 1: RECTIFIER LINEAR
TRANSFORMATION
âŻThe usual logistic and tanh transforms are
saturated at infinity
âŻThe gradient approaches zero, making
learning impossible
âŻRectifier linear function has constant gradient,
making information flows much better
Source: https://imiloainf.wordpress.com/2013/11/06/rectifier-nonlinearities/
26. METHOD 2: SKIP-CONNECTION
WITH RESIDUAL NET
5/12/16 26
http://qiita.com/supersaiakujin/items/935bbc9610d0f87607e8
Theory
http://torch.ch/blog/2016/02/04/resnets.html
Practice
27. THREE MAIN ARCHITECTURES: FNN, RNN & CNN
âŻFeed-forward (FFN): Function approximator (Vector to vector)
ÂâŻMost existing ML/statistics fall into this category
âŻRecurrent (RNN): Sequence to sequence
ÂâŻTemporal, sequential. E.g., sentence, actions, DNA, EMR
ÂâŻProgram evaluation/execution. E.g., sort, traveling salesman problem
âŻConvolutional (CNN): Image to vector/sequence/image
ÂâŻIn time: Speech, DNA, sentences
ÂâŻIn space: Image, video, relations
5/12/16 27
28. RECURRENT NET (RNN)
5/12/16 28
Example application: Language
model which is essentially to predict
the next word given previous words
in the sentence.
Embedding is often used to convert a
word into a vector, which can be
initialized from word2vec.
29. RNN IS POWERFUL BUT âŚ
âŻRNN is Turing-complete (Hava T. Siegelmann and
Eduardo D. Sontag, 1991).
âŻBrain can be thought of as a giant RNN over
discrete time steps.
âŻBut training is very difficult for long sequences
(more than 10 steps), due to:
Â⯠Vanishing gradient
Â⯠Exploding gradient
5/12/16 29
http://www.wildml.com/
30. SOLVING PROBLEM OF VANISHING/
EXPLODING GRADIENTS
âŻTrick 1: modifying the gradient, e.g., truncation for exploding gradient
(not always works)
âŻTrick 2: changing learning dynamics, e.g., adaptive gradient descent
partly solves the problem (will see later)
âŻTrick 3: modifying the information flow:
ÂâŻExplicit copying through gating: Gated Recurrent Unit (GRU, 2014)
ÂâŻExplicit memory: Long Short-Term Memory (LSTM, 1997)
32. RNN: WHERE IT WORKS
5/12/16 32
Classification
Image captioning
Sentence classification & embedding
Neural machine translation
Sequence labelling
Source: http://karpathy.github.io/assets/rnn/diags.jpeg
33. THREE MAIN ARCHITECTURES: FNN, RNN & CNN
âŻFeed-forward (FFN): Function approximator (Vector to vector)
ÂâŻMost existing ML/statistics fall into this category
âŻRecurrent (RNN): Sequence to sequence
ÂâŻTemporal, sequential. E.g., sentence, actions, DNA, EMR
ÂâŻProgram evaluation/execution. E.g., sort, traveling salesman problem
âŻConvolutional (CNN): Image to vector/sequence/image
ÂâŻIn time: Speech, DNA, sentences
ÂâŻIn space: Image, video, relations
5/12/16 33
34. LEARNABLE CONVOLUTION AS FEATURE DETECTOR
(TRANSLATION INVARIANCE)
5/12/16 34
http://colah.github.io/posts/2015-09-NN-Types-FP/
Learnable kernels
andreykurenkov.com
Feature detector,
often many
35. CNN IS (CONVOLUTION Ă POOLING)
REPEATED
5/12/16 35
F(x) = NeuralNet(Pooling(Rectifier(Conv(x)))
classifier max/mean nonlinearity feature detector
can be repeated N times - depth
adeshpande3.github.io
Design parameters:
â˘âŻ Padding
â˘âŻ Stride
â˘âŻ #Filters (maps)
â˘âŻ Filter size
â˘âŻ Pooling size
â˘âŻ #Layers
â˘âŻ Activation function
36. CNN: WHERE IT WORKS
âŻTranslation invariance: Image, video,
repeated motifs
âŻExamples:
ÂâŻSentence â a sequence of words (used on
conjunction with word embedding)
ÂâŻSentence â a sequence of characters
ÂâŻDNA sequence of {A,C,T,G}
ÂâŻRelations (e.g., right-to, left-to, father-of,
etc)
http://www.nature.com/nbt/journal/v33/n8/full/nbt.3300.html
DeepBind, Nature 2015
38. PRACTICAL REALISATION
âŻMore data
âŻMore GPUs
âŻBigger models
âŻBetter models
âŻFaster iterations
âŻPushing the limit of priors
âŻPushing the limit of patience (best models take 2-3
weeks to run)
âŻA LOT OF NEW TRICKS
http://felixlaumon.github.io/2015/01/08/kaggle-right-whale.html
Data as new fuel
39. TWO ISSUES IN LEARNING
1.⯠Slow learning and local traps
Â⯠Very deep nets make gradients uninformative
Â⯠Model uncertainty leads to rugged objective functions with exponentially many local
minima
2.⯠Data/model uncertainty and overfitting
Â⯠Many models possible
Â⯠Models are currently very big with hundreds of millions parameters
Â⯠Deeper is more powerful, but more parameters.
40. SOLVING ISSUE OF SLOW LEARNING AND
LOCAL TRAPS
âŻRedesign model, e.g., using skip-connections
âŻSensible initialisation
âŻAdaptive stochastic gradient descent
41. SENSIBLE INITIALISATION
âŻRandom Gaussian initialisation often works
well
ÂâŻTIP: Control the fan-in/fan-out norms
âŻIf not, use pre-training
ÂâŻWhen no existing models: Unsupervised learning
(e.g., word2vec, language models,
autoencoders)
ÂâŻTransfer from other models, e.g., popular in
vision with AlexNet, Inception, ResNet, etc.
Source: (Erhan et al, JMLR'10, Fig 6)
42. STOCHASTIC GRADIENT DESCENT (SGD)
âŻUsing mini-batch to smooth out gradient
âŻUse large enough learning rate to get over poor local minima
âŻPeriodically reduce the learning rate to land into a good local
minima
âŻIt sounds like simulated annealing, but without proven global
minima
âŻWorks well in practice since the energy landscape is a funnel
Source: https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-learning-work/
43. ADAPTIVE SGDS
âŻProblems with SGD
Â⯠Poor gradient information, ill-conditioning, slow convergence rate
Â⯠Scheduling learning rate is an art
Â⯠Pathological curvature
âŻSpeed search: Exploiting local search direction with momentum
âŻRescaling gradient with Adagrad
âŻA smoother version of Adagrad: RMSProp (usually good for RNNs)
âŻAll tricks combined: Adam (usually good for most jobs)
45. SOLVING ISSUE OF DATA/MODEL
UNCERTAINTY AND OVERFITTING
âŻDropouts as fast ensemble/Bayesian
âŻMax-norm: hidden units, channel and path
Dropout is known as the best trick in the past 10 years
46. DROPOUTS
âŻA method to build ensemble of neural nets at zero cost
ÂâŻFor each param update, for each data point, randomly remove part of hidden units
Source: (Srivastava et al, JMLR 2014)
47. DROPOUTS AS ENSEMBLE
âŻA method to build ensemble of neural nets at zero cost
ÂâŻThere are â2âđžâ such options for K units
ÂâŻAt the end, adjust the param with the same proportion
âŻOnly one model reported, but with the effect of n models, where n is number of
data points.
âŻCan be extended easily to:
ÂâŻDrop features
ÂâŻDrop connections
ÂâŻDrop any components
49. TWO MAJOR VIEWS OF âDEPTHâ IN DEEP
LEARNING
ÂâŻ[2006-2012] Learning layered representations, from raw data to abstracted goal (DBN, DBM, SDAE, GSN).
Â⯠Typically 2-3 layers.
Â⯠High hope for unsupervised learning. A conference set up for this: ICLR, starting in 2013.
Â⯠We will return in Part III.
ÂâŻ[1991-1997] & [2012-2016] Learning using multiple steps, from data to goal (LSTM/GRU, NTM/DNC, N2N
Mem, HWN, CLN).
Â⯠Reach hundreds if not thousands layers.
Â⯠Learning as credit-assignment.
Â⯠Supervised learning won.
Â⯠Unsupervised learning took a detour (VAE, GAN, NADE/MADE).
5/12/16 49
50. WHEN DOES DEEP LEARNING WORK?
âŻLots of data (e.g., millions)
âŻStrong, clean training signals (e.g., when human can provide correct labels â
cognitive domains).
ÂâŻAndrew Ng of Baidu: When humans do well within sub-second.
âŻData structures are well-defined (e.g., image, speech, NLP, video)
âŻData is compositional (luckily, most data are like this)
âŻThe more primitive (raw) the data, the more benefit of using deep learning.
5/12/16 50
51. IN CASE YOU FORGOT, DEEP LEARNING
MODELS ARE COMBINATIONS OF
âŻThree models:
ÂâŻFFN (layered vectors)
ÂâŻRNN (recurrence for sequences)
ÂâŻCNN (translation invariance +
pooling)
Ă Architecture engineering!
âŻA bag of tricks:
ÂâŻdropout
ÂâŻpiece-wise linear units (i.e., ReLU)
ÂâŻadaptive stochastic gradient descent
ÂâŻdata augmentation
ÂâŻskip-connections