In machine learning projects, the preparation of large datasets is a key phase which can be complex and expensive. It was traditionally done by data engineers before the handover to data scientists or ML engineers. They operated in different environments due to the differences in the tools, frameworks and runtimes required in each phase. Spark's support for different types of workloads brought data engineering closer to the downstream activities like machine learning that depended on the data. Unifying data acquisition, preprocessing, training models and batch inferencing under a single platform enabled by Spark not only provided seamless experience between different phases and helped accelerate the end-to-end ML lifecycle but also lowered the TCO in the building, managing the infrastructure to cover different phases. With that, the needs of a shared infrastructure expanded to include specialized hardware like GPUs and support deep learning workloads as well. Spark can effectively make use of such infrastructure as it integrates with popular deep learning frameworks and supports acceleration of deep learning jobs using GPUs. In this talk, we share learnings and experiences in supporting different types of workloads in shared clusters equipped for doing deep learning as well as data engineering. We will cover the following topics: * Considerations for sharing the infrastructure for big data and deep learning in Spark * Deep learning in Spark in clusters with and without GPUs * Differences between distributed data processing and distributed machine learning * Multitenancy and isolation in shared infrastructure.
https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=97
Spark summit 2019 infrastructure for deep learning in apache spark 0425
1.
2. Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
3. Agenda
• Evolution of data infrastructure
• ML workflow: Data prep & DNN training
• Intro to deep learning and computing needs
• Distributed deep learning and challenges
• Unified platform using Spark
– Infra considerations, challenges
• ML Pipelines
3#UnifiedAnalytics #SparkAISummit
6. + Machine Learning and
Deep Learning workloads
6#UnifiedAnalytics #SparkAISummit
7. How long does it take to train Resnet-50 on ImageNet?
7#UnifiedAnalytics #SparkAISummit
14 daysBefore
2017
NVIDIA M40 GPU
8. Training Resnet-50 on Imagenet
8#UnifiedAnalytics #SparkAISummit
1 hour 31 mins 15 mins
Apr Sept Nov
Tesla P100 x 256 1,600 CPUs Tesla P100 x 1,024
Facebook
Caffe2
UC Berkeley,
TACC, UC Davis
Tensorflow
Preferred Network
ChainerMN
2017
6.6 mins
Tesla P40 x 2,048
Tencent
TensorFlow
July Nov
2.0 mins
Sony
Neural Network
Library (NNL)
Tesla V100 x 3,456
2018 2019
Fujitsu
MXNet
1.2 mins
Tesla V100 x 2,048
Apr
9. Considerations for Deep Learning @ Scale
• CPU vs. GPU
• Single vs. multi-GPU
• MPI vs. non-MPI
• Infiniband vs. Ethernet
9#UnifiedAnalytics #SparkAISummit
Credits: Mathew Salvaris
https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-deployment-of-deep-learning-models/
10. “Things” you need to deal with when training
machine learning/deep learning models
Gather results
Secure Access
Scale resources
Schedule jobs
Dependencies and Containers
Provision VM clusters
Distribute data
Handling failures
12. Machine Learning and Deep Learning
12#UnifiedAnalytics #SparkAISummit
Top figure source;
Bottom figure from NVIDIA
ML
DL
13. Lots of ML
Frameworks ….
13#UnifiedAnalytics #SparkAISummit
TensorFlow PyTorch
Scikit-Learn
MXNet Chainer
Keras
14. Design Choices for Big Data and Machine Learning/Deep Learning
14#UnifiedAnalytics #SparkAISummit
Laptop Spark +
Separate infrastructure for
ML/DL training/inference
Cloud
Spark
15. Execution Models for Spark and Deep Learning
15#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
Data Parallelism Model Parallelism
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
16. Execution Models for Spark and Deep Learning
16#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
17. Execution Models for Spark and Deep Learning
17#UnifiedAnalytics #SparkAISummit
Task
1
• Independent Tasks
• Embarrassingly Parallel and Massively Scalable
• Re-run crashed task
Task
2
Task
3
Spark
• Non-Independent Tasks
• Some parallel processing
• Optimizing communication between nodes
• Re-run all tasks
Distributed Learning
Task
3
Task
2
Task
1
Credits – Reynold Xin, Project Hydrogen – State of Art Deep Learning on Apache Spark
19. 19#UnifiedAnalytics #SparkAISummit
Microsoft Machine Learning for
Apache Spark v0.16
Microsoft’s Open Source
Contributions to Apache Spark
www.aka.ms/spark Azure/mmlspark
Cognitive
Services
Spark
Serving
Model
Interpretability
LightGBM
Gradient Boosting
Deep Networks
with CNTK
HTTP on
Spark
20. Demo - Azure Databricks
and Deep Learning
20#UnifiedAnalytics #SparkAISummit
21. Demo – Distributed Deep
Learning using Tensorflow
with HorovodRunner
21#UnifiedAnalytics #SparkAISummit
22. What do you
need for
training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
24. From CUDA to NCCL1 to NCCL2
Multi-Core
CPU
GPU Multi-GPU Multi-GPU
Multi-Node
NCCL 2NCCL 1CUDA
Multi-GPU
Communication
Library
Credits: NCCL Tutorial (https://bit.ly/2KpPP44)
27. Spark & GPU
• Using GPU with Spark options:
1. Native support (cluster manager, GPU tasks): SPARK-
24615
2. Use cores/memory as proxy for GPU resources and
allow GPU-enabled code execution
3. Code implementation/generation for GPU offload
• Considerations
– Flexibility
– Data management
– Multi-GPU execution
27#UnifiedAnalytics #SparkAISummit
28. Infrastructure Considerations
• Data format, storage and reuse
– Co-locate Data Engineering storage infrastructure (cluster-local)
– DL Framework support for HDFS (reading from HDFS does not mean data-locality-aware computation)
– Sharing data between Spark and Deep Learning (HDFS, Spark-TF connector, Parquet/Petastorm)
• Job execution
– Gang scheduling – Refer to SPARK-24374
– Support for GPU (and other accelerators) – Refer to SPARK-24615
– Cluster sharing with other types of jobs (CPU-only cluster vs. CPU+GPU cluster)
– Quota management
– Support for Docker containers
– MPI vs. non-MPI
– Difference GPU generations
• Node, GPU connectivity
– Infiniband, RDMA
– GPU Interconnect options
– Interconnect-aware scheduling, minimize distribution, repacking
29. ML Pipelines
• Using machine learning pipelines, data scientists, data engineers,
and IT professionals can collaborate on different steps/phases
• Enable use of best tech for different phases in ML/DL workflow
29#UnifiedAnalytics #SparkAISummit
30. Demo – Azure ML
Pipelines & Databricks
30#UnifiedAnalytics #SparkAISummit
31. What do you
need for training /
distributed
training?
CPU
GPU
Network
Storage
Deep Learning
Framework
Memory
Physics of Machine Learning and Deep Learning
32. Kaarthik Sivashanmugam, Wee Hyong Tok
Microsoft
Infrastructure for Deep Learning
in Apache Spark
#UnifiedAnalytics #SparkAISummit
33. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT