ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
2. Jim Dowling, CEO @ Logical Clocks
Ajit Mathews, VP Machine Learning @ AMD
ROCm and Distributed
Deep Learning on Spark
and TensorFlow
#UnifiedAnalytics #SparkAISummit
jim_dowling
3. Great Hedge of India
âą East India Company was one of
the industrial worldâs first monopolies.
âą They assembled a thorny hedge
(not a wall!) spanning India.
âą You paid customs duty to bring salt
over the wall (sorry, hedge).
In 2019, not all graphics cards are allowed to be used
in a Data Center. Monoplies are not good for deep learning!
3
[Image from Wikipedia]
4. Nvidiaâą 2080Ti vs AMD Radeonâą VII
ResNet-50 Benchmark
Nvidiaâą 2080Ti
Memory: 11GB
TensorFlow 1.12
CUDA 10.0.130, cuDNN 7.4.1
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------------
FP32 total images/sec: ~322
FP16 total images/sec: ~560
AMD Radeonâą VII
Memory: 16 GB
TensorFlow 1.13.1
ROCm: 2.3
Model: RESNET-50
Dataset: imagenet (synthetic)
------------------------------------------------------------
FP32 total images/sec: ~302
FP16 total images/sec: ~415
https://lambdalabs.com/blog/2080-ti-deep-learning-benchmarks/
https://www.phoronix.com/scan.php?page=article&item=nvidia-
rtx2080ti-tensorflow&num=2
4/48
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173
5. 5
#UnifiedAnalytics #SparkAISummit
Latest Machine Learning
Frameworks
Dockers and Kubernetes
support
Optimized Math &
Communication Libraries
Up-Streamed for Linux Kernel
Distributions
Frameworks
Middleware and
Libraries
Eigen
Spark / Machine Learning Apps
Data Platform
Tools
ROCm
Fully Open Source ROCm Platform
OpenMP HIP OpenCLâą Python
Devices GPU CPU APU DLA
RCCL
BLAS, FFT,
RNG
MIOpen
O P E N S O U R C E
F O U N D A T I O N
F O R M A C H I N E
L E A R N I N G
A M D M L & H P C S O L U T I O N S
AM D M L S O F T WAR E S T R AT E G Y
6. Linux Kernel
4.17
700+ upstream ROCm driver commits since 4.12 kernel
https://github.com/RadeonOpenCompute/ROCK-Kernel-
Driver
Distro: Upstream Linux Kernel Support
9. Machine Learning Frameworks
High performance FP16/FP32
training with up to 8
GPUs/node
v1.13 â Available today as a
docker container
or as Python PIP wheel
AMD support in mainline
repository, including initial
multi-GPU support for Caffe2
Available today as a docker
container (or build from
source)
10. ROCm Distributed Training
10#UnifiedAnalytics #SparkAISummit
Optimized collective
communication operations library
Easy MPI integration
Support for Infiniband and RoCE
highspeed network fabrics
ROCm enabled UCX
ROCm w/ ROCnRDMA
RCCL
1.00X
1.99X
3.98X
7.64X
0.00X
1.00X
2.00X
3.00X
4.00X
5.00X
6.00X
7.00X
8.00X
RESNET50
Multi-GPU Scaling
(PCIe, CPU
parameter-server,
1/2/4/8 GPU)
1GPU 2GPU 4GPU 8GPU
ResNet-50
11. GPU Acceleration & Distributed training
Docker and Kubernetes Support
o Docker
o https://hub.docker.com/r/rocm/rocm-
terminal/
o Kubernetes
o https://hub.docker.com/r/rocm/k8s-
device-plugin/
Extensive Language support
including Python
o Anaconda NUMBA
o https://github.com/numba/roctools
o Conda
o https://anaconda.org/rocm
o CLANG HIP
o https://clang.llvm.org/doxygen/HIP_8h_s
ource.html
ROCm supported Kubernetes device
plugin that enables the registration of
GPU in a container cluster for compute
workload
With support for AMD's ROCm
drivers, Numba lets you write parallel
GPU algorithms entirely from Python
Horovod Support with RCCL
o Ring AllReduce
o https://github.com/ROCmSoftwarePlatfor
m/horovod
o Tensor Flow CollectiveAllReduceStrategy
o HopsML CollectiveAllReduceStrategy
ROCm supports AllReduce Distributed
Training using Horovod and HopsML
12. Hopsworks Open-Source Data and AI
12#UnifiedAnalytics #SparkAISummit
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
13. Hopsworks Open-Source Data and AI
13#UnifiedAnalytics #SparkAISummit
Data
Sources
HopsFS
Kafka
Airflow
Spark /
Flink
Spark
Feature
Store
Hive
Deep
Learning
BI Tools &
Reporting
Notebooks
Serving w/
Kubernetes
Hopsworks
On-Premise, AWS, Azure, GCE
Elastic
External
Service
Hopsworks
Service
BATCH ANALYTICS
STREAMING
ML & DEEP LEARNING
14. Hopsworks Hides ML Complexity
14#UnifiedAnalytics #SparkAISummit
MODEL TRAINING
Feature
Store
HopsML API
[Diagram adapted from âtechnical debt of machine learningâ]
https://www.logicalclocks.com/feature-store/ https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.html#
15. ROCm -> Spark / TensorFlow
âą Spark / TensorFlow
applications run
unchanged on ROCm
âą Hopsworks runs
Spark/TensorFlow on
YARN and Conda
15#UnifiedAnalytics #SparkAISummit
16. Container
A Container
is a CGroup
that isolates
CPU, memory,
and GPU
resources and
has a conda
environment
and TLS certs.
ContainerContainerContainer
YARN support for ROCm in Hops
16#UnifiedAnalytics #SparkAISummit
Resource
Manager
Node
Manager
Node
Manager
Node
Manager
Executor ExecutorExecutorDriver
17. Distributed Deep Learning Spark/TF
17#UnifiedAnalytics #SparkAISummit
Executor 1 Executor N
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
conda_env conda_env
# RUNS ON THE EXECUTORS
def train(lr, dropout):
def input_fn(): # return dataset
optimizer = âŠ
model = âŠ
model.add(Conv2D(âŠ))
model.compile(âŠ)
model.fit(âŠ)
model.evaluate(âŠ)
# RUNS ON THE DRIVER
Hparams= {âlrâ:[0.001, 0.0001],
âdropoutâ: [0.25, 0.5, 0.75]}
experiment.grid_search(train,HParams)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
18. Distributed Deep Learning Spark/TF
18#UnifiedAnalytics #SparkAISummit
Executor 1 Executor N
Driver
HopsFS
TensorBoard ModelsCheckpoints Training Data Logs
conda_env
conda_env conda_env
# RUNS ON THE EXECUTORS
def train():
def input_fn(): # return dataset
model = âŠ
optimizer = âŠ
model.compile(âŠ)
rc = tf.estimator.RunConfig(
âCollectiveAllReduceStrategyâ)
keras_estimator = tf.keras.estimator.
model_to_estimator(âŠ.)
tf.estimator.train_and_evaluate(
keras_estimator, input_fn)
# RUNS ON THE DRIVER
experiment.collective_all_reduce(train)
More details: Spark Summit Europe 2018 talk https://www.youtube.com/watch?v=tx6HyoUYGL0
https://github.com/logicalclocks/hops-examples
19. 19#UnifiedAnalytics #SparkAISummit
Horizontally Scalable ML Pipelines with Hopsworks
Raw Data
Event Data
Monitor
HopsFS
Feature
Store Serving
Feature StoreData PrepIngest DeployExperiment/Train
Airflow
logs
logs
Metadata Store
20. Pipelines from Jupyter Notebooks
20#UnifiedAnalytics #SparkAISummit
Convert .ipynb to .py
Jobs Service
Run .py or .jar
Schedule using
REST API or UI
materialize certs, ENV variables
View Old Notebooks, Experiments and Visualizations
Experiments &
Tensorboard
PySpark Kernel
.ipynb (HDFS contents)
[logs, results]
Livy Server
HopsYARNHopsFS
Interactive
materialize certs, ENV variables