Overview
• Both topics covered here are a result from my Summer Internship
• Work is available on GitHub
• Tool for creating “Standard” HPCC Systems Platform Virtual Machines
• Hyper-V, AWS, Azure, VirtualBox, etc…
• https://github.com/xwang2713/cloud-image-build
• In addition, used for creating NVIDIA GPU Enabled VMs (AWS AMI)
• Started a GPU Enabled Deep Learning Bundle
• Demonstrating GPU accelerated Deep Learning on HPCC Systems
• https://github.com/hpcc-systems/GPU-Deep-Learning
GPU Accelerated HPCC Systems | Robert Kennedy 2

HPCC Systems on Hyper-V
• Used Packer.io to generate machine images
• To create a Hyper-V Image:
• https://github.com/xwang2713/cloud-image-build/tree/master/packer/hyper-v
• Hyper-V VMs can be used similarly to the VirtualBox VMs you might already be
using
• Hyper-V Images build locally, on a Hyper-V enabled machine
• Installed programs list can be easily modified in a .JSON format
• HPCC Systems Platform running on Hyper-V allows for Docker Desktop
(windows) use
• Docker Desktop uses Hyper-V and Hyper-V and VirtualBox can’t run
concurrently

Config File
• Packer.io uses .json file as config
• Defines network (ex. for VirtualBox)
• Defines size of machine (for cloud
providers)
• Config defines which software to be
installed via standard Linux
commands

GPU Enabled Virtual Machines
• Using the same tool, GPU enabled VMs can be created
• Cloud images build in cloud, local images build locally
• This work supports the use of Python 3.6, CUDA 10.0, TensorFlow 1.14, and
PyTorch 1.1
• AWS GPU Instances:
• K80s, V100s
• Azure GPU Instances:
• K80s [12 gigs vram]
• V100s [16 gigs vram] (with and without NVLink)
• P100s [16 gigs vram]

HPCC Systems and GPU Accelerated Deep Learning
• Current HPCC Systems are CPU only, and so is its DL runtimes
• My previous work was with Distributed DL on HPCC Systems using only
CPUs
• Traditional HPCC Systems use commodity computers connected via standard
network protocols
• With respect to Deep Learning, this presents a large communication bottle
neck, partly due to its iterative nature
• Graphics Processing Units (GPU) are used to decrease the computation time for
Neural Networks
• Single or Multiple GPUs are connected to the CPU (central node) via much
faster hardware connections
• A new bundle was started to enable GPU accelerated Deep Learning on HPCC
Systems Platform

GPU Accelerated Deep Learning
• With this bundle, you can train NN models on the GPU
• Sprayed data is used as training data
• Bundle is in its infancy, but you can build, train, and use neural networks
• Using only ECL
• Using ECL and Python, allows for more customized NN architectures and
training routines
• A trained model (either in ECL or ECL+Python) can be used to predict on sprayed
data
• It returns its predictions via records in a one-hot-encoded format

Bundle Implementation Overview
• Current work uses only one Thor node
• Single Thor node still can use multiple GPUs
• ECL/HPCC Systems handles the data storage and execution of the NN runtimes
• The implementation is uses data parallelism across one ore more GPUs
• Currently limited to only a single physical computer
• The pyembed plugin allows for Python to run on HPCC Systems Platform
• We use Python 3, as Python 2 is nearing EOL
• Python code handles the NN training and interfaces with GPUs directly using
NVIDIA’s CUDA language

TensorFlow | Keras
• The Python code is in the form of
TensorFlow
• TensorFlow
• Google’s Popular Deep Learning
Library
• Keras
• Deep Learning Library API – uses
TensorFlow or other ‘backend’
• Much less code to produce same
model
10

Biological Neuron
• Basis for artificial neural networks
• Such as the ones in deep learning
• Dendrites
• Input vector, from previous
neurons
• Weights
• Soma
• Summation Function
• Axon
• Activation Function
• A neuron 'fires” when there is enough
of an input stimulus
Dendrite
Axon
Soma

Artificial Neuron
• First concept in 1943
• Inputs of the neuron are the outputs
of the previous layer’s neurons
• The input weights are summed with a
bias
• Then passed into an activation
function
• Activation Functions are like the
biological neurons ‘deciding’ to fire
• ReLu activation – gives output x if
x>0, and outputs 0, if x<0, where x is
the input

A Fully Connected Network
• Fully Connected Network
• Each neuron is connected to
every neuron in the subsequent
layer
• Neural Network Visualization
• 2 hidden layers, fully connected, 3
class classification output
• Multi-Layer Perceptron is an example

Neural Network Training
• Forwardpropagation
• Backpropagation
• Optimize Model with respect to Loss
Function
• Quantification of how “right or wrong” the
model for any given datum
• Gradient Descent
• Stochastic Gradient Descent (SGD)
• Mini-batch SGD
• Right: visualization of gradient
descent over an example loss
function
Gradient Descent In Action

Where Exactly Do the GPUs Come Into Play?
• Training a NN Model is the most
time-consuming part, this is where
the GPU is used to dramatically
reduce computation time
• Two main training steps
• Forward pass – weights and
errors
• Backward pass – gradients and
weight updates
• Computationally expensive
convolutions are offloaded onto
GPUs
• These steps are done for each data
point, multiple times GPU Accelerated HPCC Systems | Robert Kennedy 16

Parallel Paradigms
• Data Parallelism
• Model Parallelism
• Synchronous and
Asynchronous
• Parallel SGD
Data Parallelism Model Parallelism

Model Parallelism
• Neural Network Model is split across
nodes
• For models larger than a GPU’s
memory
• Requires significantly higher
communication bandwidths between
nodes
• Not well suited for a cluster system
• However, this paradigm is feasible for a
multi-GPU system due to faster hardware
speeds

Data Parallelism
• Data is partitioned and distributed to
nodes
• A singe NN model is replicated onto
each node
• Only weight updates are communicated
and aggregated
• As defined by the specific parallel
training method
• Suitable for parallelizing across multiple
nodes in HPCC Systems cluster or
across GPUs in a single system
• This is the paradigm that is used

Not Your Average HPCC Systems
• Slightly different than traditional HPCC
Systems topologies
• Whole figure represents a single physical
computer and Thor Node
• Parameter Server
• This is the CPU on the system
• Nodes (blue)
• Each node represents a single
physical GPU
• Connections are high speed
hardware
• PCI Express is up to 985 MB/s
per each 16 lanes
• NVLINK is roughly 10x faster
than PCIe Gen 3

• We will create a Convolutional Neural Network (CNN) and train on the MNIST
Dataset
• MNIST is a 10-class image classification dataset, handwritten digits 0-9
• The CNN takes 784 pixels as an input (each with range 0-255)
• Two Convolutional Layers
• One fully connected layer with 128 neurons
• 10 Output neurons (one for each class)
• Total of 1,199,882 trainable parameters
• Processing through 720,000 MNIST images
Bundle Usage Example Architecture

Spray MNIST Dataset
• MNIST included in bundle
• Test and Train, 785 fixed length
• 60,000 28x28 grayscale images
• 10,000 28x28 grayscale images
• Both are labeled as one of 10
classes, 0-9

Image Visualization
• Imported RAW MNIST
Data
• Visualization of a single
MNIST image in the
“data” format
• Each pixel has value
between 0-255,
represented as 2-digit
hex numbers
• Each pixel is a feature

Preparing the Data
• Currently, the bundle demonstrates how to train on image data
• Includes Example NN and the example dataset (MNSIT and Fashion
MNIST)
• Training data and labels is molded into a NumPy array with specified shape
before training
• Here, shape is the dimensions of the image
• i.e. the dimensions of the input features
• These get flattened to an array of 784 inputs for 784 input neurons

Creating a CNN – model.add() method
• First, we define the optimizer and its
parameters
• Next, we define the training scheme
• Batch size = 128
• We’ll train for 20 epochs

Creating a CNN – model.add() method
• Next, we define the NN architecture
• Input shape, 28x28x1 grayscale
images
• Initialize the model
• The “nnOutputLayer” is the final layer
and is, at this point, the entire NN
model thus far

• “nnOutputLayer” is passed into model.train() along with hyperparameters and
training data
Train the CNN – model.add() method
GPU:
CPU:

Create CNN – ECL and Python

Example Input and Output
Image Input
One-Hot-Encoded Output

Performance Evaluation
• A case study was performed to measure the performance improvements
• 5 identical Convolutional Neural Networks are trained on the MNIST dataset
• 10 times each to provide statistical significance
• Measuring the required training time for the same model on same data using fixed
training parameters
• Faster training time is desired
• CPU Alone, 1, 2, 3, and 4 GPUs
• Older K80’s are used
• Newer GPUs will only increase performance and efficiency
• Compared against each other and compared against the “optimal” speed up
• i.e. linear speedup

Performance Boost: GPU vs. CPU
• Time, in seconds, to train a CNN on
MNIST dataset
• Training time speedup is 5.4x
between a Xeon CPU vs a K80 GPU
• Speedup is large, even for a
simple model on small and simple
data
• The training time is measuring NN
training time, not necessarily any
HPCC-specific computations that
would be the same during CPU or
GPU

Performance Boost: CPU vs. GPU vs Optimal Speedup
• Optimal Speed up is linear
• i.e. twice the nodes is twice as fast
• Speedup is not expected to be linear
due to communication overheads
• Results show that additional GPUs
have minimal cost

Conclusion
• Tool used to create HPCC Systems Virtual Images on various new platforms
• Good use case is to create GPU enabled images
• Brief overview of Neural Networks and their optimization
• Demonstrated that GPU accelerated deep learning is possible on HPCC Systems
Platform
• Demonstrated that GPU provides significant performance increase, even on non-
traditional cluster

• Implementing generalizable data loaders
• To allow for a training on data with less knowledge of NumPy (Python)
• Continue adding to the supported methods and ECL modeling functions
• Research and Development on integrating model parallelism
• Research on NN training on multi-node clusters where each node can have one
or more GPUs
Future Work

Links
• GitHub
• https://github.com/hpcc-systems/GPU-Deep-Learning
• https://github.com/xwang2713/cloud-image-build
• NVIDIA CUDA
• https://developer.nvidia.com/cuda-toolkit
• TensorFlow
• https://www.tensorflow.org/
• Keras
• https://keras.io/
• NumPy
• https://numpy.org/

Robert Kennedy
PhD Candidate, Florida Atlantic
University
rkennedy@fau.edu
Questions?

View this presentation on YouTube:
https://www.youtube.com/watch?v=GMt-_Io4Jys&list=PL-
8MJMUpp8IKH5-d56az56t52YccleX5h&index=8&t=0s (4:02)

Expanding HPCC Systems Deep Neural Network Capabilities

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Expanding HPCC Systems Deep Neural Network Capabilities

Ähnlich wie Expanding HPCC Systems Deep Neural Network Capabilities (20)

Mehr von HPCC Systems

Mehr von HPCC Systems (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Expanding HPCC Systems Deep Neural Network Capabilities