Basic concept of Deep Learning with explaining its structure and backpropagation method and understanding autograd in PyTorch. (+ Data parallism in PyTorch)
2. Objective
ïµ Understanding AutoGrad
ïµ Review
ïµ Logistic Classifier
ïµ Loss Function
ïµ Backpropagation
ïµ Chain Rule
ïµ Example : Find gradient from a matrix
ïµ AutoGrad
ïµ Solve the example with AutoGrad
ïµ Data Parallism in PyTorch
ïµ Why should we use GPUs?
ïµ Inside CUDA
ïµ How to parallelize our models
ïµ Experiment
4. Logistic Classifier (Fully-Connected)
ðð + b = y
2.0
1.0
0.1
p = 0.7
p = 0.2
p = 0.1
S(y)
ProbabilityLogits
X : Input
W, b : To be trained
y : Prediction
S(y) : Softmax function (Can be other activation functions)
A
B
C
ð ðŠ =
ð ðŠ ð
ð ð ðŠ ð
represents the probabilities of elements in vector ðŠ.
A
Instance
6. Loss Function
ïµ The vector can be very large when there are a lot of classes.
ïµ How can we find the distance between vector S(Predict) and L(Label) ?
ð· ð, ð¿ = â
ð
ð¿ð log(ðð)
0.7
0.2
0.1
1.0
0.0
0.0
S(y) L
â» D(S,L) â D(L,S)
Donât worry to take log(0)
ð ðŠ =
ð ðŠð
ð ð ðŠ ð
7. In-depth of Classifier
Let thereâre equations âŠ
1. Affine Sum
ð(ð¥) = ðð¥ + ðµ
2. Activation Function
ðŠ(ð) = ð ðð¿ð ð
3. Loss Function
ðž ðŠ =
1
2
ðŠð¡ððððð¡ â ðŠ
2
4. Gradient Descent
ð€ â ð€ â ðŒ
ððž
ðð€
ð â ð â ðŒ
ððž
ðð
⢠Gradient Descent requires
ððž
ðð€
and
ððž
ðð
.
⢠How can we find them? -> Use chain rule !
ðŠð¡ððððð¡ : Training data
ðŠ : Prediction result
10. Example : Finding gradient of ð
ïµ Let input tensor ð is initialized by following square matrix of 3rd order.
ð =
1 2 3
4 5 6
7 8 9
ïµ And ð, ð is defined following âŠ
ð = ð + 3
ð = 6(ð)2
= 6( ð + 3)2
ïµ And output ð¿ is the average of tensor ð
ð¿ = ðððð ð =
1
9
ð ð
ððð
11. Example : Finding gradient of ð
ïµ We can find scalar ððð from its definition (Linearity)
ððð = 6(ððð)2
ððð = ððð + 3
ïµ To find gradient, We use âChain Ruleâ so that we can find partial gradients.
ðð¿
ðððð
=
1
9
,
ðððð
ðððð
= 12ððð,
ðððð
ðððð
= 1
ðð¿
ðððð
=
ðð¿
ðððð
ðððð
ðððð
ðððð
ðððð
=
1
9
â 12ððð â 1 =
4
3
ððð + 3
12. Example : Finding gradient of ð
ïµ Thus, We can get a gradient of (1,1) element of ð
ðð¿
ðððð
=
4
3
ððð + 3 |(ð, ð)=(1,1) =
4
3
1 + 3 =
16
3
ïµ Like this, We can get whole gradient matrix of ð âŠ
ðð¿
ð ð
=
ðð¿
ðð11
ðð¿
ðð12
ðð¿
ðð13
ðð¿
ðð21
ðð¿
ðð22
ðð¿
ðð23
ðð¿
ðð31
ðð¿
ðð32
ðð¿
ðð33
=
16
3
20
3
24
3
28
3
32
3
36
3
40
3
44
3
48
3
20. Why GPU? (CUDA)
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
T T
Core
âŠ
3584 cores
Good for few huge tasks Good for enormous small tasks
3.6 GHz
1.6 GHz
(2.0 GHz @ O.C)
21. Dataflow Diagram
CPU GPU
Memory MemorycudaMemcpy()
cudaMalloc()
__global__ sum()
hello.cu
NVCC
Co-processor
CPU GPU
d_a
d_b
d_out
h_a
h_b
h_out
1.Memcpy
sum
2.Kernal call (cuBLAS)
3.Memcpy
22. CUDA on Multi GPU System
Quad SLI
14,336 CUDA cores
48GB of VRAM
How can we use multi GPUs in PyTorch?
24. Problem
- Duration & Memory Allocation
ïµ Large batch size causes lack of memory.
ïµ Out of memory error from PyTorch -> Python kernel dies.
ïµ Canât set large batch size.
ïµ Can afford batch_size = 5, num_workers = 2
ïµ Canât divide up the work with the other GPUs
ïµ Elapsed Time : 25m 44s (10 epochs)
ïµ Reached 99% of accuracy in 9 epochs (for training set)
ïµ It takes too much time.
25. Data Parallelism in PyTorch
ïµ Implemented using torch.nn.DataParallel()
ïµ Can be used for wrapping a module or model.
ïµ Also support primitives (torch.nn.parallel.*)
ïµ Replicate : Replicate the model on multiple devices(GPUs)
ïµ Scatter : Distribute the input in the first-dimension.
ïµ Gather : Gather and concatenate the input in the first-dimension.
ïµ Apply-Parallel : Apply a set of already-distributed inputs to a set of already-distributed
models.
ïµ PyTorch Tutorials â Multi-GPU examples
ïµ https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
26. Easy to Use : nn.DataParallel(model)
- Practical Example
1. Define the model.
2. Wrap the model with nn.DataParallel().
3. Access layers through âmoduleâ
27. After Parallelism
- GPU Utilization
ïµ Hyperparameters
ïµ Batch Size : 128
ïµ Number of Workers : 16
ïµ High Utilization.
ïµ Can use large memory space.
ïµ Allocated all GPUs
28. After Parallelism
- Training Performance
ïµ Hyperparameters
ïµ Batch Size : 128
ïµ Large batch size need more memory space
ïµ Number of Workers : 16
ïµ Recommended to set (4 * NUM_GPUs) â From the forum
ïµ Elapsed Time : 7m 50s (10 epochs)
ïµ Reached 99% of accuracy in 4 epochs (for training set).
ïµ It just taken 3m 10s.