Training Recurrent Neural Networks at Scale: One of our projects at Baidu’s Silicon Valley AI Lab is using deep learning to develop state of the art end-to-end speech recognition systems based on recurrent neural networks for multiple languages. The training set for each language is multiple terabytes in size and each model requires in excess of 10 Exaflops to train. Training such models requires scale and techniques that are unusual for deep learning but more common in high performance computing. I will talk about the challenges involved and the software and hardware solutions that we employ.
2. Erich Elsen
Natural User Interfaces
• Goal: Make interacting with computers as
natural as interacting with humans
• AI problems:
– Speech recognition
– Emotional recognition
– Semantic understanding
– Dialog systems
– Speech synthesis
3. Erich Elsen
Deep Speech Applications
• Voice controlled apps
• Peel Partnership
• English and Mandarin APIs in the US
• Integration into Baidu’s products in China
4. Erich Elsen
Deep Speech: End-to-end learning
• Deep neural network predicts
probability of characters directly from
audio
. . .
. . .
T H _ E … D O G
6. Erich Elsen
Deep Speech: CTC
E .01 .05 .1 .1 .8 .05
H .01 .1 .1 .6 .05 .05
T .01 .8 .75 .2 .05 .1
BLANK .97 .05 .05 .1 .1 .8
• Simplified sequence of network outputs
(probabilities)
• Generally many more timesteps than letters
• Need to look at all the ways we can write “the”
• Adjacent characters collapse
• TTTHEE, TTTTHE, TTHHEE, THEEEE, ….
• Solve with dynamic programming
Time
7. Erich Elsen
warp-ctc
• Recently open sourced our CTC
implementation
• Efficient, parallel CPU and GPU backend
• 100-400X faster than other implementations
• Apache license, C interface
https://github.com/baidu-research/warp-ctc
8. Erich Elsen
Accuracy scales with Data
Data & Model Size
Performance
Deep Learning algorithms
Many previous methods
• 40% error reduction for each 10x increase in dataset size
9. Erich Elsen
Training sets
• Train on ~1½ years of data (and growing)
• English and Mandarin
• End-to-end deep learning is key to
assembling large datasets
• Datasets drive accuracy
10. Erich Elsen
Large Datasets = Large Models
Dataset Size
Big Model
Small Model
Accuracy
• Models require over 20 Exa-flops to train (exa =
10^18)
• Trained on 4+ Terabytes of audio
13. Erich Elsen
Parallelism across GPUs
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data
For these models, Data Parallelism works best
14. Erich Elsen
Performance for RNN training
• 55% of GPU FMA peak using a single GPU
• ~48% of peak using 8 GPUs in one node
• Weak scaling very efficient, albeit algorithmically
challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TFLOP/s
Number of GPUs
Typical
training run
one node multi node
15. Erich Elsen
All-reduce
• We implemented our own all-reduce out of
send and receive
• Several algorithm choices based on size
• Careful attention to affinity and topology
16. Erich Elsen
Scalability
• Batch size is hard to increase
– algorithm, memory limits
• Performance at small batch sizes (32, 64)
leads to scalability limits
17. Erich Elsen
Precision
• FP16 also mostly works
– Use FP32 for softmax and weight updates
• More sensitive to labeling error
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
Count
Magnitude
Weight Distribution
18. Erich Elsen
Conclusion
• We have to do experiments at scale
• Pushing compute scaling for end-to-end
deep learning
• Efficient training for large datasets
– 50 Teraflops/second sustained on one model
– 20 Exaflops to train each model
• Thanks to Bryan Catanzaro, Carl Case, Adam Coates for donating some slides
Erich Elsen
Hinweis der Redaktion
Model Parallel: Latency sensitive
Data Parallel: Bandwidth sensitive