2. Neural net Training basics
Vectorization / Different kinds of data
Parameters - A whole neural net consists of a graph and
parameter vector
Minibatches - Neural net data requires lots of ram. Need to do
minibatch training
4. Parameters / Neural net structure
Computation graph - a neural net is just a dag of
ndarrays/tensors
The parameters of a neural net can be made in to a vector
representing all the connections/weights in the graph
5. Minibatches
Data is partitioned in to sub samples
Fits on gpu
Trains faster
Should be representative sample (every label present) as evenly
as possible
8. Multiple GPUs
Single box
Could be multiple host threads
RDMA (Remote Direct Memory Access) interconnect
NVLink
Typically used on a data center rack
Break problem up
9. Multiple GPUs and Multiple Computers
Coordinate problem over cluster
Use GPUs for compute
Can be done via MPI or hadoop (host thread coordination)
Parameter server - synchronize parameters over master as well
as handling things like gpu interconnect
11. Lots of different algorithms
All Reduce
Iterative Reduce
Pure Model parallelism
Parameter Averaging is key here
12. Core Ideas
Partition problem in to chunks
Can be neural net
As well as data
Use as many cuda or cpu cores as possible
13. How does parameter averaging work?
Replicate model across cluster
Train on different portions of data with same model
Synchronize as minimally as possible while producing a good
model
Hyper parameters should be more aggressive (higher learning
rates)
17. Tuning distributed training
Averaging acts as a form of regularization
Needs more aggressive hyper parameters
Not always going to be faster - account for amount of data
points you have
Distributed systems applies here: Send code to data not other
way around
Reduce communication overhead for max performance