Phx dl meetup

Phoenix Deep Learning
Meet Up Group
James Sirota
5/30/2017

Welcome
• Thank you for attending
• Thanks to Galvanize for sponsoring
• Speakers wanted
• Additional Sponsors Wanted
• Pizza/pop
• Speaker visits
• Please email james.sirota@yahoo.com

Today:
• Brief intro to deep learning
• Deep learning use cases
• Deep Feed Forward NNs
• Gradient Decent
• Introduction to Tensorflow
• Tensorflow Example

Who are we?
• Data Scientists?
• Big Data Engineers?
• Software Engineers?
• Other Technologists?
• Students?
• Startups?

Speakers Wanted
• Scaling Tensorflow
• Enterprise Use Cases
• Competing Libraries
• Torch
• Theano
• Caffe
• Microsoft Cognitive Toolkit
•Topic of your choice?
Deep Dreams
Chat Bots

In “Nature” 27 January 2016:
• “DeepMind’s program AlphaGo beat Fan
Hui, the European Go champion, five
times out of five in tournament
conditions...”
• “AlphaGo was not preprogrammed to
play Go: rather, it learned using a
general-purpose algorithm that allowed
it to interpret the game’s patterns.”
• “…AlphaGo program applied deep
learning in neural networks
(convolutional NN) — brain-inspired
programs in which connections between
layers of simulated neurons are
strengthened through examples and
experience.”
7

Deep Learning Today
• Advancement in speech recognition in the last 2 years
• A few long-standing performance records were broken with deep learning methods
• Microsoft and Google have both deployed DL-based speech recognition systems in
their products
• Advancement in Computer Vision
• Feature engineering is the bread-and-butter of a large portion of the CV community,
which creates some resistance to feature learning
• But the record holders on ImageNet and Semantic Segmentation are convolutional
nets
• Advancement in Natural Language Processing
• Fine-grained sentiment analysis, syntactic parsing
• Language model, machine translation, question answering
11

12
Engine management
• The behaviour of a car engine is influenced
by a large number of parameters
– temperature at various points
– fuel/air mixture
– lubricant viscosity.
• Major companies have used neural networks
to dynamically tune an engine depending on
current settings.

13
ALVINN
Drives 70 mph on a public highway
30x32 pixels
as inputs
30 outputs
for steering
30x32 weights
into one out of
four hidden
unit
4 hidden
units

14
Signature recognition
• Each person's signature is different.
• There are structural similarities which are
difficult to quantify.
• One company has manufactured a machine
which recognizes signatures to within a high
level of accuracy.
– Considers speed in addition to gross shape.
– Makes forgery even more difficult.

15
Sonar target recognition
• Distinguish mines from rocks on sea-bed
• The neural network is provided with a large
number of parameters which are extracted
from the sonar signal.
• The training set consists of sets of signals
from rocks and mines.

16
Stock market prediction
• “Technical trading” refers to trading based
solely on known statistical parameters; e.g.
previous price
• Neural networks have been used to attempt
to predict changes in prices.
• Difficult to assess success since companies
using these techniques are reluctant to
disclose information.

17
Mortgage assessment
• Assess risk of lending to an individual.
• Difficult to decide on marginal cases.
• Neural networks have been trained to make
decisions, based upon the opinions of expert
underwriters.
• Neural network produced a 12% reduction in
delinquencies compared with human experts.

Motivations for Deep Architectures
• Insufficient depth can hurt
• With shallow architecture (SVM, NB, KNN, etc.), the required number of nodes in the graph
(i.e. computations, and also number of parameters, when we try to learn the function) may
grow very large.
• Many functions that can be represented efficiently with a deep architecture cannot be
represented efficiently with a shallow one.
• The brain has a deep architecture
• The visual cortex shows a sequence of areas each of which contains a representation of the
input, and signals flow from one to the next.
• Note that representations in the brain are in between dense distributed and purely local:
they are sparse: about 1% of neurons are active simultaneously in the brain.
• Cognitive processes seem deep
• Humans organize their ideas and concepts hierarchically.
• Humans first learn simpler concepts and then compose them to represent more abstract
ones.
• Engineers break-up solutions into multiple levels of abstraction and processing
20

Deep Learning = Learning Hierarchical Representations
21

Limitations of Neural Networks
Random initialization + densely connected networks lead to:
• High cost
• Each neuron in the neural network can be considered as a logistic regression.
• Training the entire neural network is to train all the interconnected logistic regressions.
• Difficult to train as the number of hidden layers increases
• Recall that logistic regression is trained by gradient descent.
• In backpropagation, gradient is progressively getting more dilute. That is, below top layers,
the correction signal 𝛿" is minimal.
• Stuck in local optima
• The objective function of the neural network is usually not convex.
• The random initialization does not guarantee starting from the proximity of global optima.
• Solution:
• Deep Learning/Learning multiple levels of representation
22

Deep Feed Forward Neural Nets
(in 1 Slide (J))
So what then is learning?
hθ(x(i))
hypothesis
(x(i),y(i))
Learning is adjusting the wi,j’s such that the cost
function J(θ) is minimized (a form of Hebbian learning)

2.7
-8.6
0.002
f(x)
1.4
-2.5
-0.06
x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34

Activation Functions
Sigmoid
tanh tanh(x)
ReLU max(0,x)
Maxout
ELU
Leaky ReLU
max(0.1x, x)

tanh(x)
• Squashes numbers to range [-1,1]
• Zero centered (nice)
• Still kills gradients when saturated :(
• Also used in LSTMs for bounded, signed
values.
• Not as good for binary functions
[LeCun et al., 1991]

28
ReLU
(Rectified Linear Unit)
- Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)
- Not zero-centered output
- An annoyance:
hint: what is the gradient when x < 0?

Source: https://www.youtube.com/watch?v=ILsA4nyG7I0

A dataset
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …

Training the neural network
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Initialise with random weights

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Present a training pattern
1.4
2.7
1.9

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Feed it through to get output
1.4
2.7 0.8
1.9

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Compare with target output
1.4
2.7 0.8
0
1.9 error 0.8

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Adjust weights based on error
1.4
2.7 0.8
0
1.9 error 0.8

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Present a training pattern
6.4
2.8
1.7

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Feed it through to get output
6.4
2.8 0.9
1.7

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Compare with target output
6.4
2.8 0.9
1
1.7 error -0.1

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
Adjust weights based on error
6.4
2.8 0.9
1
1.7 error -0.1

Training data
Fields class
1.4 2.7 1.9 0
3.8 3.4 3.2 0
6.4 2.8 1.7 1
4.1 0.1 0.2 0
etc …
And so on ….
6.4
2.8 0.9
1
1.7 error -0.1
Repeat this thousands, maybe millions of times – each time
taking a random training instance, and making slight
weight adjustments
Algorithms for weight adjustment are designed to make
changes that will reduce the error

The decision boundary perspective…
Initial random weights

Present a training instance / adjust the weights

Eventually ….

49
Different Non-Linearly
Separable Problems
Structure
Types of
Decision Regions
Exclusive-OR
Problem
Classes with
Meshed regions
Most General
Region Shapes
Single-Layer
Two-Layer
Three-Layer
Half Plane
Bounded By
Hyperplane
Convex Open
Or
Closed Regions
Arbitrary
(Complexity
Limited by No.
of Nodes)
A
AB
B
A
AB
B
A
AB
B
B
A
B
A
B
A

Hidden layer units become
self-organised feature detectors
…
1
63
1 5 10 15 20 25 …
strong +ve weight
low/zero weight

What does this unit detect?
…
1
63
1 5 10 15 20 25 …
strong +ve weight
low/zero weight

…
1
63
1 5 10 15 20 25 …
strong +ve weight
low/zero weight
it will send strong signal for a horizontal
line in the top row, ignoring everywhere else

…
1
63
1 5 10 15 20 25 …
strong +ve weight
low/zero weight
Strong signal for a dark area in the top left
corner

What features might you expect a good NN
to learn, when trained with data like this?

Backpropagation Algorithm – Main
Idea – error in hidden layers
The ideas of the algorithm can be summarized as follows :
1. Computes the error term for the output units using the
observed error.
2. From output layer, repeat
- propagating the error term back to the previous layer
and
- updating the weights between the two layers
until the earliest hidden layer is reached.

Backpropagation Algorithm
• Initialize weights (typically random!)
• Keep doing epochs
• For each example e in training set do
• forward pass to compute
• O = neural-net-output(network,e)
• miss = (T-O) at each output unit
• backward passto calculate deltas to weights
• update all weights
• end
• until tuning set error stops improving
Backward pass explained in next slideForward pass explained
earlier

Learning Algorithm:
Backpropagation
In the next algorithm step the output signal of the network y is
compared with the desired output value (the target), which is found in
training data set. The difference is called error signal d of output layer
neuron

Learning Algorithm:
Backpropagation
The idea is to propagate error signal d (computed in single teaching step)
back to all neurons, which output signals were input for discussed
neuron.

Learning Algorithm:
Backpropagation
The weights' coefficients wmn used to propagate errors back are equal to
this used during computing output value. Only the direction of data flow
is changed (signals are propagated from output to inputs one after the
other). This technique is used for all network layers. If propagated errors
came from few neurons they are added. The illustration is below:

Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).

Bias
Each neuron is like a simple logistic regression and you
have y=σ(Wx+b). The input values are multiplied with the
weights and the bias affects the initial level of squashing
in the sigmoid function (tanh etc.), which results the
desired the non-linearity.
For example, assume that you want a neuron to
fire y≈1 when all the input pixels are black x≈0. If there is
no bias no matter what weights W you have, given the
equation y=σ(Wx) the neuron will always fire y≈0.5.
Tanh0
Bias = 6
1[data values between -1 & 1]

Monitor and visualize the loss curve

TensorFlow
• What is it:
• Neural networks software for numerical computation - uses data flow graphs for computation
• Developed at Google’s machine intelligence research organization
• What can it be used for:
• Any machine neural network problem
• Video Demonstration
• Six minute video introduction on TensorFlow on youtube.
• Further information:
• www.tensorflow.org
• https://www.youtube.com/watch?v=bYeBL92v99Y
74

Torch
• What is it:
• Torch is a scientific computing framework for machine learning.
• The goal is to be flexible and allow the building of scientific algorithms quickly - contains neural network
and optimization libraries
• Machine learning neural network problems
• Three minute introduction on youtube.
• http://torch.ch/
• https://www.youtube.com/watch?v=uxja6iwOnc4&list=PLjJh1vlSEYgvGod9wWiydumYl8hOXixNu&index
=19
75

CNTK
• What is it:
• CNTK stands for Computational Network Toolkit - created by Microsoft.
• Designed for use with CPUs or GPUs (ie, graphical processing units)
• Can be used for image classification problems, video analysis, speech recognition and natural language
processing.
• A two minute introduction on youtube.
• https://www.cntk.ai/
• https://www.youtube.com/watch?v=-mLdConF1EU
76

Caffee
• What is it:
• Caffee is a deep learning framework designed to be modular and fast – used with
CPUs or GPUs.
• Developed by Berkeley Vision and Learning Center (BLVC) and community
contributors.
• Originally developed for machine vision; but, now able to handle speech and text
problems.
• A three minute introduction on youtube.
• http://caffe.berkeleyvision.org/
• https://www.youtube.com/watch?v=bOIZ74rOik0
77

Computational Graph
Construction Phase
Execution
Phase

How to contribute
https://github.com/tensorflow/tensorflow

References
• Bordes, A., Chopra, S., & Weston, J. (2014). Question answering with subgraph embeddings.arXiv preprint
arXiv:1406.3676.
• Graves, A., Mohamed, A. R., & Hinton, G. (2013, May). Speech recognition with deep recurrent neural
networks. InAcoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp.
6645-6649). IEEE.
• Graves, A. (2013). Generating sequences with recurrent neural networks.arXiv preprint arXiv:1308.0850.
• Irsoy, O., & Cardie, C. (2014, October). Opinion Mining with Deep Recurrent Neural Networks.
InEMNLP(pp. 720-728).
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector
space.arXiv preprint arXiv:1301.3781.
• Rubenstein, H., & Goodenough, J. B. (1965). Contextual correlates of synonymy. Communications of the
ACM,8(10), 627-633.
• Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013, October).
Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the
conference on empirical methods in natural language processing (EMNLP)(Vol. 1631, p. 1642).
• Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks.
InAdvances in neural information processing systems(pp. 3104-3112).
• Tai, K. S., Socher, R., & Manning, C. D. (2015). Improved semantic representations from tree-structured
long short-term memory networks.arXiv preprint arXiv:1503.00075. 94

Phx dl meetup

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Phx dl meetup

Ähnlich wie Phx dl meetup (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Phx dl meetup