Distributed Deep Learning on AWS with Apache MXNet

Pop-up Loft
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Deep Learning with MXNet workshop
Vikram Madan
Sr. Product Manager, AWS Deep Learning
vikmadan@amazon.com

Agenda
• Deep Learning Basics
• Apache MXNet - Overview
• Apache MXNet – Programming Model
• Apache MXNet – MNIST Code Deep Dive
• Closing Demo
• Q & A

Pop-up Loft
Deep Learning Basics

Machine Learning 101
Shallow Learning
• Extract clever features
(preprocessing)
• Map into feature space
(kernel methods)
• Set of rules
(decision tree)
• Combine multiple estimates
(boosting)
Deep Learning
• Many simple neurons
• Specialized layers
(images, text, audio, …)
• Stack layers
(hence deep learning)
• Optimization is difficult
§Backpropagation
§Stochastic gradient descent
usually simple to learn better accuracy

0.2
-0.1
...
0.7
Input Output
1 1 1
1 0 1
0 0 0
3
mx.sym.Pooling(data, pool_type="max", kernel=(2,2), stride=(2,2)
lstm.lstm_unroll(num_lstm_layer, seq_len, len, num_hidden, num_embed)
4 2
2 0
4=Max
1
3
...
4
0.2
-0.1
...
0.7
mx.sym.FullyConnected(data, num_hidden=128)
2
mx.symbol.Embedding(data, input_dim, output_dim = k)
Queen
4 2
2 0
2=Avg
Input Weights
cos(w, queen) = cos(w, king) - cos(w, man) + cos(w, woman)
mx.sym.Activation(data, act_type="xxxx")
"relu"
"tanh"
"sigmoid"
"softrelu"
Neural Art
Face Search
Image Segmentation
Image Caption
“People Riding Bikes”
Bicycle, People,
Road, Sport
Image Labels
Image
Video
Speech
Text
“People Riding Bikes”
Machine Translation
“Οι άνθρωποι
ιππασίας ποδήλατα”
Events
mx.model.FeedForward model.fit
mx.sym.SoftmaxOutput
Anatomy of a Deep Learning Model
mx.sym.Convolution(data, kernel=(5,5), num_filter=20)
Deep Learning Models

Biological & Artificial Neuron
slide from http://cs231n.stanford.edu/
Source: http://cs231n.github.io/neural-networks-1/

Linear Algebra & Matrix Multiplication
Requirement
# of Columns in A
must equal
# of Rows in B
Output
# of Rows in A
# of Columns in B

Matrix Multiplication with Neural Networks

Inputs: Data Preprocessing, Batches, Epochs
Preprocessing
§ Random separation of data into
training, validation, and test sets
§ Necessary to measuring the
accuracy of the model
Batch
§ Amount of data propagated
through network at every iteration
§ Enables faster optimization
through shorter iteration cycles
Epoch
§ Complete pass through all the
training data
§ Optimization will have multiple
epochs to reduce error rate

Fully Connected Layer
Each node (“neuron”) in a layer is connected to every node in the previous layer

Activation Functions
Add nonlinearity to a layer and
applied to the layer’s output
There are several options:
§ Rectified Linear Unit (ReLU)
§ Sigmoid
§ Hyperbolic Tangent (tanh)
§ Softplus
ReLU functions are the most
commonly used today

Deep Neural Network
Hidden layers
Optimal size of a hidden layer
(number of nodes) is typically
between the size of the input
and size of the output layers
Input layer
Output

The “Learning” in Deep Learning
0.4 0.3
0.2 0.9
...
backpropagation (gradient descent)
X1 != X
0.4 ± 𝛿 0.3 ± 𝛿
new
weights
new
weights
0
1
0
1
1
.
.
-
-
X
input
label
...
X1

Classification with the Softmax Function
Softmax Function
Source: https://stats.stackexchange.com/questions/273465/neural-network-softmax-activation
Softmax converts the output layer into probabilities – necessary for classification

Loss Function
• It is an objective function that quantifies how successful
the model was in its predictions
• It is a measure of the difference between a neural net’s
prediction and the actual value – that is, the error
• Typically, we use Cross Entropy Loss, which adjusts
the plain loss calculation to mitigate learning slowdown
• Backpropagation is performed to calculate the error
contribution of each neuron after processing one batch

Gradient Descent
Iteratively update parameters to get the most optimal value for the objective function

Stochastic Gradient Descent
Gradient Descent
A single iteration for the
parameter update runs through
ALL of the training data
Stochastic Gradient Descent,
A single iteration for the
parameter update runs through
a BATCH of the training data

Optimizers and Learning Rates Visualization
http://imgur.com/a/Hqolp

Why do we need a Validation and Training Set?
Best model
When only evaluating accuracy using the training set, we face the Overfitting issue

Dropout
Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”, JMLR 2014

Convolution Neural Networks (CNN)
CNN Layers
Convolutional Layer
Pooling Layer
Activation
Fully-Connected Layer

Convolution Neural Networks (CNN)

Full Convolutional Neural Network Structure

Recurrent Neural Networks (RNN)
Image Caption Sentiment Analysis Machine Translation Video LabelingImage Labeling

Pop-up Loft
Apache MXNet – Overview

Apache MXNet
Programmable Portable High Performance
Near linear scaling
across hundreds of GPUs
Highly efficient
models for mobile
and IoT
Simple syntax,
multiple languages
88% efficiency
on 256 GPUs
Resnet 1024 layer network
is ~4GB

Ideal
Inception v3
Resnet
Alexnet
88%
Efficiency
1 2 4 8 16 32 64 128 256
No. of GPUs
• Cloud formation with Deep Learning AMI
• 16x P2.16xlarge. Mounted on EFS
• Inception and Resnet: batch size 32, Alex net: batch
size 512
• ImageNet, 1.2M images,1K classes
• 152-layer ResNet, 5.4d on 4x K80s (1.2h per epoch),
0.22 top-1 error
Scaling with MXNet

MXNet Model Zoo
http://mxnet.io/model_zoo/

http://bit.ly/deepami
Deep Learning any way you want on AWS
Tool for data scientists and developers
Setting up a DL system takes (install) time & skill
Keep packages up to date and compiled (MXNet, TensorFlow, Caffe, Torch,
Theano, Keras)
Anaconda, Jupyter, Python 2 and 3
NVIDIA Drivers for G2 and P2 instances
Intel MKL Drivers for all other instances (C4, M4, …)
Deep Learning AMIs

Pop-up Loft
MXNet – Programing Model

import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
• Straightforward and flexible.
• Take advantage of language
native features (loop,
condition, debugger)
• E.g. Numpy, Matlab, Torch, …
• Hard to optimize
PROS
CONS
d = c + 1c
Easy to tweak
with python codes
Imperative Programing

• More chances for optimization
• Cross different languages
• E.g. TensorFlow, Theano,
Caffe
• Less flexible
PROS
CONS
C can share memory with D
because C is deleted later
A = Variable('A')
B = Variable('B')
C = B * A
D = C + 1
f = compile(D)
d = f(A=np.ones(10),
B=np.ones(10)*2)
A B
1
+
X
Declarative Programing

IMPERATIVE
NDARRAY API
DECLARATIVE
SYMBOLIC
EXECUTOR
>>> import mxnet as mx
>>> a = mx.nd.zeros((100, 50))
>>> b = mx.nd.ones((100, 50))
>>> c = a + b
>>> c += 1
>>> print(c)
>>> import mxnet as mx
>>> net = mx.symbol.Variable('data')
>>> net = mx.symbol.FullyConnected(data=net, num_hidde
>>> net = mx.symbol.SoftmaxOutput(data=net)
>>> texec = mx.module.Module(net)
>>> texec.forward(data=c)
>>> texec.backward()
NDArray can be set
as input to the graph
MXNet: Mixed programming paradigm

Embed symbolic expressions into imperative programming
texec = mx.module.Module(net)
for batch in train_data:
texec.forward(batch)
texec.backward()
for param, grad in zip(texec.get_params(), texec.get_grads()):
param -= 0.2 * grad
MXNet: Mixed programming paradigm

Pop-up Loft
Thank You
Vikram Madan
vikmadan@amazon.com

Distributed Deep Learning on AWS with Apache MXNet

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Distributed Deep Learning on AWS with Apache MXNet

Ähnlich wie Distributed Deep Learning on AWS with Apache MXNet (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Distributed Deep Learning on AWS with Apache MXNet