2. Background
• Spark
– Batch/Streaming processing, Spark SQL, MLLib
• Deep learning has many use cases in Baidu and showed
significant improvement in quality
– Image retrieval ranking
– Ads CTR prediction
– Machine translation
– Speech recognition
• Goal: able to use distributed deep learning in Spark
3. Typical Training Data in Baidu
• Image Recognition: 100 millions
• OCR: 100 millions
• Speech: 10 billions
• CTR: 100 billions
• Grows every year since 2012
5. Paddle
• Parallel Asynchronous Distributed Deep Learning Library
– Support distributed parameter servers to do
synchronous/asynchronous parameter update
– Support Multi GPU / CPU training
– Support sparse model
– Easy to understand API for user to add new layers
– Support rich features for deep learning use cases,
especially for NLP
6. Deep learning options comparison
Caffe Tensor Flow Torch Paddle
Distributed Training Yes Yes No Yes
Communication
Cost
Medium High N/A Medium to Low
Easy to customize
and coding
Yes More Learning
Curve
More Learning
Curve
Yes
Sparse Model
Support
No Yes Yes Yes
Area of Focus Vision All All All
Integration with
Spark
Yes No No Yes
7. High Level Goals
• Implement Spark ML abstractions to let user train
deep learning models with minimal code change
• Leverage paddle’s native training and parameter
server mechanisms to be scheduled in spark deep
learning jobs
• Handle multi-tenancy and heterogeneity
• Parallelize hyper parameter selection
• Batch and Streaming learning
16. Design decisions
• Spark ML Compatible API
– Compatible with Spark is more important than
implemented under Spark
• Code level configuration
– Easy and flexible
– Manual is prone to error
18. Sharded Parameter Server
• One parameter an one
trainer co-locate in a
machine.
• Parameters are shared,
but not replicated.
• All-to-all communication.
• Our environments
• 4 GPUs per machine.
• 4-10 machines.
• all machines in one
switch
• reliable data center.
19. GPU Ring Synchronization
• Each parameter only needs
to go through slow
connection two times.
• One for reduce.
• Another for scatter.
20. ImageNet Scale Experiments
0
10
20
30
40
1 2 3 4 5
Time(s)
Number of machines
Time per 100 batches
TCP RDMA
• AlexNet on ImageNet
• batch size = 64
• 1 Machine has 4 K10
GPUs.
22. Sparse Training Experiment
0
75
150
225
1 2 4 8 16
Time(s)
Number of nodes
Time per 100 batches
Non Sparse Sparse
• 1451594 dimensional
sparse feature.
• Embedded to 128d, 96d,
96d, and 128d.
• Using a ranking cost on
the top.
• batch size = 128.
Pros
Distributed
Comprehensive algorithm support
Fast
Strong NLP support
Cons
Difficult to use
Training resource management
Wasteful data copying
Reliability
Cons 还包含manually model serving
Spark take care of everthing outside of training algorithm , includes data preparation(maybe also preprocess by spark, model serving in the same app)
Yarn take care of RM, to support hyper parameter training and multi tennancy.
Yarn enable job level heterogeneous computing and provide better resource utilization.
We wrap raw training data inside DF. The source could be a DB table or a HDFS folder.
Partition based on the size of paddle trainer and push training input data through RPC interface into paddle trainer.
Our work got some inspiration from caffe on spark, but we took a different design choice.
Leave DP as a black box to keep it’s training performance and use spark/yarn to make engineers’ life easier.
Keep deep learning training scheduling intact inside paddle because scheduling during training is a heavily tuned result involved deep learning background(trainer speed equal) …
Here shows an ideal case of a training which colocate trainer and executor in a 1 to 1 ratio.
We can have a spark executor running on a cpu machine and push training input to a gpu machine for trainer with Yarn RM.
API design starts here:
Encapsulation paddle inside spark ML API, accomplish Deep learning support in spark ML.
Our work demonstrate how DP platform could be fit into spark ML seamlessly.
Current Spark ML parameter abstraction is not enough for deep learning, so we extend it with a neural network parameter to support DP.
Hyper parameter is easier to achieve here through code level configuration.
You can code your own rule of tuning process and have spark take care of the rest.
Code level config can do sanity check like spelling, correct unreasonable configs.
训练过程动态增减trainer 数量
资源抢占
资源混布 cpu gpu 混合使用
Serving model service could be deployed through yarn after training automatically.